UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Qualitative repository analysis with RepoGrams Rozenberg, Daniel 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2015_september_rozenberg_daniel.pdf [ 1.59MB ]
JSON: 24-1.0166565.json
JSON-LD: 24-1.0166565-ld.json
RDF/XML (Pretty): 24-1.0166565-rdf.xml
RDF/JSON: 24-1.0166565-rdf.json
Turtle: 24-1.0166565-turtle.txt
N-Triples: 24-1.0166565-rdf-ntriples.txt
Original Record: 24-1.0166565-source.json
Full Text

Full Text

Qualitative Repository Analysis with RepoGramsbyDaniel RozenbergB.Sc., The Open University of Israel, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)August 2015c© Daniel Rozenberg, 2015AbstractThe availability of open source software projects has created an enormous op-portunity for empirical evaluations in software engineering research. However,this availability requires that researchers judiciously select an appropriate set ofevaluation targets and properly document this rationale. This selection process isoften critical as it can be used to argue for the generalizability of the evaluated toolor method.To understand the selection criteria that researchers use in their work we sys-tematically read 55 research papers appearing in six major software engineer-ing conferences. Using a grounded theory approach we iteratively developed acodebook and coded these papers along five different dimensions, all of which re-late to how the authors select evaluation targets in their work. Our results indicatethat most authors relied on qualitative and subjective features to select their evalu-ation targets.Building on these results we developed a tool called RepoGrams, which sup-ports researchers in comparing and contrasting source code repositories of multi-ple software projects and helps them in selecting appropriate evaluation targets fortheir studies. We describe RepoGrams’s design and implementation, and evaluateit in two user studies with 74 undergraduate students and 14 software engineeringresearchers who used RepoGrams to understand, compare, and contrast variousmetrics on source code repositories. For example, a researcher interested in evalu-ating a tool might want to show that it is useful for both software projects that arewritten using a single programming language, as well as ones that are written usingdozens of programming languages. RepoGrams allows the researcher to find a setof software projects that are diverse with respect to this metric.iiWe also evaluate the amount of effort required by researchers to extend Re-poGrams for their own research projects in a case study with 2 researchers.We find that RepoGrams helps software engineering researchers understandand compare characteristics of a project’s source repository and that RepoGramscan be used by non-expert users to investigate project histories. The tool is de-signed primarily for software engineering researchers who are interested in ana-lyzing and comparing source code repositories across multiple dimensions.iiiPrefaceThe work presented in this thesis was conducted in the Software Practices Labunder supervision of Prof. Ivan Beschastnikh.The user studies described in this thesis were approved by the UBC BehaviouralResearch Ethics Board under the certificate H14-02474.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1 Selection of evaluation targets . . . . . . . . . . . . . . . . . . . 82.2 Literature surveys . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Software visualizations . . . . . . . . . . . . . . . . . . . . . . . 103 Project selection approaches in Software Engineering (SE) literature 123.1 Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12v3.2 Codebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 RepoGrams’s design and implementation . . . . . . . . . . . . . . . 214.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.1.1 Visual abstractions . . . . . . . . . . . . . . . . . . . . . 234.1.2 Mapping values into colors with buckets . . . . . . . . . . 244.1.3 Supported interactions . . . . . . . . . . . . . . . . . . . 274.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . . 284.3 Implemented metrics . . . . . . . . . . . . . . . . . . . . . . . . 295 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.1 User study with undergraduate students . . . . . . . . . . . . . . 355.1.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . 355.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 385.2 User study with SE researchers . . . . . . . . . . . . . . . . . . . 385.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . 395.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.2.3 Semi-structured interview . . . . . . . . . . . . . . . . . 445.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 475.3 Estimation of effort involved in adding new metrics . . . . . . . . 475.3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 486 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.1 Additional features . . . . . . . . . . . . . . . . . . . . . . . . . 496.2 Expanded audience . . . . . . . . . . . . . . . . . . . . . . . . . 506.3 Further evaluations . . . . . . . . . . . . . . . . . . . . . . . . . 517 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53viA Literature survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63A.1 Full protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63A.1.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63A.1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 63A.1.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 64A.2 Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65A.2.1 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67A.3 Raw results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67B Undergraduate students study . . . . . . . . . . . . . . . . . . . . . 71B.1 Slides from the in-class demonstration . . . . . . . . . . . . . . . 71B.2 Protocol and questionnaire . . . . . . . . . . . . . . . . . . . . . 77B.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 77B.2.2 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . 77B.2.3 Demographics . . . . . . . . . . . . . . . . . . . . . . . 78B.2.4 Warmup questions . . . . . . . . . . . . . . . . . . . . . 79B.2.5 Metric comprehension questions . . . . . . . . . . . . . . 82B.2.6 Questions about comparisons across projects . . . . . . . 84B.2.7 Exploratory question . . . . . . . . . . . . . . . . . . . . 86B.2.8 Open comments . . . . . . . . . . . . . . . . . . . . . . 87B.2.9 Filtering results . . . . . . . . . . . . . . . . . . . . . . . 87B.3 Raw results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87C Software engineering researchers study . . . . . . . . . . . . . . . . 98C.1 Protocol and questionnaire . . . . . . . . . . . . . . . . . . . . . 98C.1.1 Procedure overview . . . . . . . . . . . . . . . . . . . . . 98C.1.2 Study protocol . . . . . . . . . . . . . . . . . . . . . . . 98C.1.3 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . 99C.1.4 Filtering results . . . . . . . . . . . . . . . . . . . . . . . 106C.2 Raw results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106D Case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108D.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108D.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 108viiD.1.2 Raw results . . . . . . . . . . . . . . . . . . . . . . . . . 109E License and availability . . . . . . . . . . . . . . . . . . . . . . . . . 110viiiList of TablesTable 3.1 SE conferences that we reviewed in our literature survey. . . . 13Table 3.2 Selection criteria codes and frequencies from our literature survey. 15Table 3.3 Project visibility codes and frequencies from our literature survey. 18Table 4.1 Alphabetical list of all metrics included in the current imple-mentation of RepoGrams. . . . . . . . . . . . . . . . . . . . . 30Table 5.1 Main questions from the advanced user study. . . . . . . . . . 40Table A.1 Results on the initial set of 59 papers used to seed the codebook. 68Table A.2 Results and analysis of the survey of 55 paper. . . . . . . . . . 70Table B.1 Raw results from the demographics section in the user studywith undergraduate students. . . . . . . . . . . . . . . . . . . 88Table B.2 Raw results from the warmup section (questions 1–4) in the userstudy with undergraduate students. . . . . . . . . . . . . . . . 89Table B.3 Raw results from the metrics comprehension section (questions5–7) in the user study with undergraduate students. . . . . . . . 90Table B.4 Raw results from the metrics comprehension section (questions8–10) in the user study with undergraduate students. . . . . . . 91Table B.5 Raw results from the project comparison section (questions 11–13) in the user study with undergraduate students. . . . . . . . 93Table B.6 Raw results from the exploratory question in the user study withundergraduate students. . . . . . . . . . . . . . . . . . . . . . 95ixTable C.1 Raw results from the user study with SE researchers. . . . . . . 107Table D.1 Raw results from the case study to estimate the effort involvedin the implementation of new metrics. . . . . . . . . . . . . . . 109xList of FiguresFigure 1.1 The repository footprint visual abstraction . . . . . . . . . . . 3Figure 1.2 Repository footprints for a section of repository histories ofsqlitebrowser and postr projects . . . . . . . . . . . . 4Figure 3.1 Frequency of the number of evaluation targets by number ofpapers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Figure 4.1 RepoGrams interface: (1) input field to add new projects, (2)button to select the metric(s), (3) a repository footprint corre-sponding to a specific project/metric combination. The colorof a commit block represents the value of the metric on thatcommit, (4) the legend for commit values in the selected met-ric(s), (5) zoom control, (6) button to switch block length rep-resentation and normalization mode, (7) buttons to remove orchange the order of repository footprints, (8) way of switchingbetween grouping by metric and grouping by project (see Fig-ure 4.4), (9) Tooltip displaying the exact metric value and thecommit message (truncated), (10) metric name and description 22Figure 4.2 All six combinations of block length and normalization modes 24Figure 4.3 Examples of legends generated from buckets for the Languagesin the Commit, Files Modified, and Number of Branches metrics. 25xiFigure 4.4 RepoGrams supports two ways of grouping repository foot-prints: (a) the metric-grouped view facilitates comparison ofdifferent projects along the same metric, and (b) the project-grouped view facilitates comparison of the same project alongdifferent metrics. . . . . . . . . . . . . . . . . . . . . . . . . 27Figure 5.1 RepoGrams showing the repository footprints as it was duringthe user study with undergraduate students, question 5. . . . . 37Figure 5.2 RepoGrams showing the repository footprints as it was duringthe user study with SE researchers, question 4. . . . . . . . . 41Figure 5.3 RepoGrams showing the repository footprints as it was duringthe user study with SE researchers, question 5. . . . . . . . . 41Figure 5.4 RepoGrams showing the repository footprints as it was duringthe user study with SE researchers, question 6. . . . . . . . . 42Figure 5.5 RepoGrams showing the repository footprints as it was duringthe user study with SE researchers, question 7. . . . . . . . . 43Figure 5.6 RepoGrams showing the repository footprints as it was duringthe user study with SE researchers, question 8. . . . . . . . . 43Figure 5.7 RepoGrams showing the repository footprints as it was duringthe user study with SE researchers, question 9. . . . . . . . . 44xiiGlossarySE Software EngineeringVCS Version Control SystemCVS Concurrent Versions SystemLoC Lines of CodeTA Teaching AssistantACM Association for Computing MachineryICSE International Conference on Software EngineeringMSR Working Conference on Mining Software RepositoriesFSE International Symposium on the Foundations of Software EngineeringASE International Conference on Automated Software EngineeringESEM International Symposium on Empirical Software Engineering and Mea-surementxiiiAcknowledgmentsFirst and foremost I would like to thank my supervisor, Dr. Ivan Beschastnikh, forconstantly pushing me forward. I am grateful to Dr. Gail C. Murphy and Dr. MarcPalyart who have both played a part as important as Ivan’s or mine in this work.I would also like to thank our colleagues from Saarland University: FabianKosmale, Valerie Poser, Heiko Becker, Maike Maas, Sebastian Becking, and MarcJose for developing the initial versions of RepoGrams.In addition, I convey my sincerest thanks to the 112 people who participated inand piloted our two user studies.Finally, no acknowledgment section would be complete without thanking mylabmates in both the SPL (Software Practices Labs) and the NSS (Networks, Sys-tems, and Security) lab where I spent most of my time due to the presence ofwindows which the SPL lacked.This work was supported by NSERC and by the UBC Computer Science de-partment.xivDedicationSoSwI’vaD vavwI’vaD je. . .. . . tuqmaj quvmoHjaj QeD ghItlhvam.To my parents. . .. . . who support and encourage me in more ways than one.xvChapter 1IntroductionSoftware Engineering (SE) researchers are increasingly using open source projectinformation stored in centralized hosting sites, such as GitHub and Bitbucket, tohelp evaluate their ideas. Researchers who developed a tool or method as part oftheir research can evaluate them on artifacts from software projects in these cen-tralized repositories, e.g., source code, execution logs, social meta-data. GitHubalone hosts tens of millions of projects, has about 9 million registered users and isone of the top 100 most popular websites in the world [9, 42]. Although there havebeen recent studies that take advantage of this enormous availability by evaluat-ing artifacts from hundreds or thousands of projects from such repositories in theirevaluations [17, 52], most studies rely on just a handful of evaluation targets thatwere picked manually by the researchers. We conducted a literature survey of 114papers from major SE conferences, 65 papers of which evaluated their researchwith 8 or fewer evaluation targets, in part because of the detailed analysis neededfor the questions being asked. For these studies the availability of projects has aflip side: although it became easier to access artifacts of software projects for usein evaluations, researchers now need a strategy for selecting the project or projectswhose artifacts they will use in the evaluation of their tool or method.To understand the existing project selection strategies we reviewed in detail55 recently published SE papers and found that most of them relied on qualitativefeatures of the evaluation targets that were selected. For example, consider twostudies published at the Foundations of Software Engineering (FSE) 2012 confer-1ence in which the authors selected projects for their evaluation in ad hoc ways.The first study focused on detecting insecure component usage [37]; the authorsdescribe their selection strategy as follows: “we have analyzed [six] applicationsusing widely-used components (such as the IE browser components and the FlashPlayer)”. The paper does not clarify elsewhere why the authors selected these sixspecific applications. By failing to report the reasons that these applications wereselected over any others, we the readers can only make educated guesses againstwhich metrics of software projects this tool’s usefulness generalizes. e.g., whetherthe tool is useful when applied to software projects with a diverse number of pro-gramming languages, set of build tools, varying team sizes.In another study from FSE 2012, the authors developed new approaches tosummarize bug reports [40]. For their selection strategy, the authors relied on priorwork and access to an internal IBM product: “We conducted our experiments ontwo subjects. The first subject . . . was obtained from [34]. . . . The second subjectwas obtained from IBM DB2 team.” Here, the rationale for selecting the subjectDB2 is unclear. Moreover, it is not clear if the subject DB2 has added value overthe subjects from prior work (e.g., whether or not it is substantially different interms of bug reporting practices).In our literature survey we found that SE researchers rarely state explicitlywhether they employed a strategy to select evaluation targets, with notable excep-tion in papers that evaluate the same targets as related works. It is therefore likelythat researchers find evaluation targets in an haphazard manner and cannot reporton their selection process. Based on these findings we believe that SE researcherscould benefit from a tool designed to select appropriate software projects for theirevaluations. Prior work has proposed new metric-based approaches to selectingdiverse projects for research evaluation [45]. We add a different dimension to thisperspective by proposing a tool to help support the existing qualitative selectionpractices in the SE community.We designed, implemented, and evaluated a new tool, called RepoGrams. Re-poGrams supports researchers in comparing and contrasting source code reposito-ries of multiple software projects and helps them in selecting appropriate evaluationtargets for their studies. Our evaluations, comprising of a literature survey of SE2papers, two user studies, and one case study, demonstrate that RepoGrams is usefulto SE researchers who are interested in assessing and comparing the developmentactivity of multiple software projects.Figure 1.1: The repository footprint visual abstractionA B CLength : commit sizeBlock : commitTimeProject : Color : commit metric valueA key feature of RepoGrams is its ability to compactly present informationabout the history of the source code of multiple software projects in a way that is(1) relevant to a researcher’s task, and (2) simplifies comparison. RepoGrams ac-complishes this by computing user-selected (or user-implemented) metrics on soft-ware repositories. These metrics use artifacts in the Git repository (e.g., number ofLines of Code (LoC) changed in a commit, the commit author, the branch in whichthe commit occurred) to compute either scalar or discrete values. RepoGrams thenvisualizes the metric values using the repository footprint visual abstraction (Fig-ure 1.1). A repository footprint (footprint for short) represents all the individualcommits from a software project’s source code repository as a sequence of blocks,one block per commit.The blocks are laid out left-to-right in temporal order. In Figure 1.1 the com-mit represented by block A was committed to the repository before the commitof block B, and both were committed before commit of block C. Other than thisstrict ordering of the commits, the footprint does not reveal any more informationregarding time. i.e., we cannot know how much time passed between the commitsrepresented by blocks A and B based on their position.3A block’s color and length represent the corresponding commit’s value in theselected metric, and a user-selected mode respectively. One example of a blocklength mode is a linear correspondence between the commit’s size in terms of LoCchanged and each block’s length.Visual information is processed differently in the human mind. The visualsystem provides a high-bandwidth channel that processes information in parallel,compared to verbal information that is processed serially [43]. This allows usersof the tool to comprehend the presented information faster than they would if theinformation was represented verbally or numerically.Figure 1.2: Repository footprints for a section of repository histories ofsqlitebrowser and postr projects12-130-1 2-3 4-5 6-78-9 10-11 14-16sqlitebrowser: postr: metric: number of concurrent brancheslegend: For example, Figure 1.2 shows the repository footprints of two projects: asqlitebrowser [7] footprint (top) and a postr [5] footprint (bottom). In thisexample the metric is Number of Branches, a metric that counts the number ofactive concurrent branches at the time a commit was introduced. From this fig-ure we can make two observations: (1) in contrast to postr, sqlitebrowserprogressively uses more concurrent branches over time. We can see this by ob-serving the footprints left to right: the footprint for sqlitebrowser becomesdarker and it eventually has as many as 14–16 concurrent branches. In contrast,the footprint for postr does not change its color and remains in the range of 4–5concurrent branches. (2) sqlitebrowser’s footprint contains commits that aresignificantly larger than those in postr. We can see this by comparing the lengthof the commit blocks between the two footprints.4A researcher might want, for example, to study why some development teamschange the way they use branches over time. This researcher can use RepoGramsto identify different patterns of branch usage and select projects with repositoryfootprints that exhibit irregular patterns to perform an evaluation on. If most of theprojects in this researcher’s scope were to exhibit a steady increase in the numberof branches over time, our researcher might choose those projects whose reposi-tory footprints exhibit different patterns. e.g., remain steady over time, alternatebetween a low and a high number of branches, remain at a constant number untilthe project’s half-point and then increases steadily.While designing RepoGrams we were inspired by the promises of mining Gitrepositories as discussed by Bird et al. [13], such as Promise 2: “Git facilitatesrecovery of richer project history through commit, branch and merge informationin the DAGs of multiple repositories.” We use the abundance of data and meta-data recorded by Git to provide a rich framework for the various metrics that Re-poGrams can visualize. We also took the perils they described into account whenimplementing specific metrics, some of which we detail in Section 4.3.RepoGrams’s goal is to allow SE researchers to characterize and select a di-verse set of evaluation targets, especially when their approach or perspective in-volves the analysis of project evolution. RepoGrams can help researchers strengthenclaims to generality of their evaluated tool or method, and show that their tool doesindeed support this diverse set of projects. RepoGrams also allows the researchcommunity to easily infer that the claims of past papers do indeed support a di-verse set of projects by sending the repository URL of the software projects thatwere used to evaluated a paper to RepoGrams and inspecting the resulting reposi-tory footprints using relevant metrics.1.1 Research questionsWe developed RepoGrams to help SE researchers answer complex questions aboutrepository histories to help them select evaluation targets. To help understand ifRepoGrams serves this purpose, we posed the following research questions:• RQ1: Can SE researchers use RepoGrams to understand and compare char-acteristics of a project’s source repository?5• RQ2: Will SE researchers consider using RepoGrams to select evaluationtargets for experiments and case studies?Before approaching the first two research questions we wanted to know if Re-poGrams could be used by non-SE experts to investigate project histories, leadingus to pose this third research question:• RQ3: How usable is the RepoGrams visualization and tool?During a user study we performed with SE researcher, many of them mentionedthe need to define custom metrics in a visualization tool such as RepoGrams. Wethus posed the following fourth and final research question:• RQ4: How much effort is required to add metrics to RepoGrams?We investigate these questions in Chapter 5.1.2 ContributionsWe make the following contributions:• We report on a qualitative study of 55 papers from six major SE conferencesin 2012–2014 to understand how SE researchers select project-based evalu-ation targets for their papers. Our results indicate that (1) most papers selectprojects based on qualitative metrics that are subjective, (2) most papers use8 or fewer evaluation targets, and (3) there is a need in the SE research com-munity for tool support in understanding project characteristics over time.• Based on our qualitative study we designed and implemented RepoGrams, aweb-based tool to analyze and juxtapose the evolution of one or more soft-ware projects. RepoGrams supports SE researchers in exploring the spaceof possible evaluation targets and is built on an extensible, metrics-based,visualization model that can be adapted to a variety of analyses. RepoGramsis free software, making it easily available for others to deploy and use.6• We evaluated RepoGrams with two user studies. We conducted a user studywith 74 fourth year undergraduate students. We found that RepoGrams canbe used by non-SE experts to investigate project histories. In another userstudy with 14 active SE researchers from the Mining Software Reposito-ries (MSR) community we determined that they can use RepoGrams to un-derstand and compare characteristics of a project’s source repository.• Finally, we evaluated the effort involved in adding six new metrics to Re-poGrams in a case study with two software developers. We found that ourstudy participants learned how to add their first new metric, and later addtwo more metrics, all within less than an hour for each metric.The rest of this thesis is organized as follows. Chapter 2 lists related work.Chapter 3 describes the literature survey that we conducted. Chapter 4 discussesthe design and implementation of RepoGrams. Chapter 5 describes the evaluationof RepoGrams with two user studies and a case study. Chapter 6 discusses ideas forfuture work, and finally Chapter 7 concludes the thesis. There are five appendices.7Chapter 2Related workIn this section we present past work related to the one presented in the rest ofthis thesis. We split these works into categories: Section 2.1 lists works related tothe methods that SE researchers employ when selecting evaluation targets for theirempirical studies, Section 2.2 presents the literature survey which we discuss inChapter 3, and finally Section 2.3 lists some of the relevant related work in the vastfield of software visualizations.2.1 Selection of evaluation targetsThe problem of helping SE researchers perform better empirical evaluations hasbeen previously considered from three different angles. First, there are ongoingefforts to curate high-quality evaluation targets in the form of bug reports [8, 51],faults [36], source code [60], etc. Such database artifacts promote scientific re-peatability and comparison between proposed techniques. Second, researchershave developed tools like GHTorrent [30], Lean GHTorrent [31], Boa [24], Met-ricMiner [56], and the Orion search engine [15] to simplify access to, filtering, andanalysis of projects hosted in large repositories such as GitHub. These tools makeit easier to carry out evaluations across a large number of projects. Finally, recentwork by Nagappan et al. has proposed improvements for sampling projects [45].Their notion of sample coverage is a metrics-based approach for selecting and re-8porting the diversity of a sample set of SE projects. Unlike these three strandsof prior work, we designed RepoGrams to support researchers in better decisionmaking when it comes to selecting evaluation targets.RepoGrams is designed to assist with research that evaluates software targets.However, SE researchers perform studies spanning the entire life-cycle of the SEprocess, using other types of data sources. For example, de Mello et al. [21] pro-pose a framework for a more rigorous selection strategy when choosing humansubjects for user studies, such as interviews or developer surveys.2.2 Literature surveysThe work closest in methodology to our literature survey in Chapter 3 (albeit withdifferent goals) is the literature survey by Hemmati et al. [33] which considers 117research papers from MSR 2004–2012. They perform a grounded theory analysisto identify best practices prevalent in the MSR community across four high-levelthemes, including data acquisition and replication. Borges et al. [16] conducted astudy of 891 papers from SE venues to discover which mechanisms are applied byresearchers to support their empirical studies by reviewing the text and citations ofeach paper.Collberg et al. [50] explored the repeatability of the research described in 601papers from Association for Computing Machinery (ACM) conferences and jour-nals, focusing on research that was backed by code. Ghezzi et al. [27] developedSOFAS, a platform devised to provide software analyses and to combine them intoanalysis workflows, aimed primarily at the MSR community in order to createreplicable empirical experiments. To motivate their work they surveyed 88 papersthat described empirical studies from MSR 2004–2011 conference. Siegmund etal. [55] performed a literature survey of 405 papers from SE conferences to getan overview of the current state of the art regarding discussions about validity andthreats to validity in papers which include empirical research. Our literature reviewdiffers in its focus on how and why SE researchers selected the evaluation targetsin their studies.Jedlitschka et al. [35] review past guidelines for reporting study results in SEpapers and propose the adoption of a unified standard. Among other matters, they9discuss the need to report on the sampling strategy that was used to select theexperiment’s subjects. Sampling strategies are the main feature which we extractedin our literature survey.2.3 Software visualizationsRepoGrams builds on a broad set of prior work on visualization of software evo-lution [23] and software metrics [41]. The focus of prior visualization work is onnovel visualizations to support developers, managers, and other stakeholders in thesoftware development process. The target population in our work on RepoGramsis the SE researcher community. We now overview the most relevant work fromthis space.Novel visualizations span a broad range: Revision Towers [59] presents a vi-sual tower that captures file histories and helps to correlate activity between files.GEVOL [18] is a graph-based tool for capturing the detailed structure of a pro-gram, correlating it against project activity and showing how it evolves over time.Chronos [54] assists developers in tracking the changes to specific code snippetsacross the evolution of a program. RelVis [48] visualizes multivariate release his-tory data using a view based on Kiviat diagrams. The evolution matrix [38] sup-ports multiple metrics and shows the evolution of system components over time.Chronia [28] is a tool to visualize the evolution of code ownership in a reposi-tory. Spectographs [66] shows how components of a project evolve over time andlike RepoGrams extensively uses color. Other approaches to visualizing softwareevolution are further reviewed in [20]. Other such tools are surveyed by Storeyet al. [57] in their work to define a framework for visualization tools that supportawareness of human activities in software development. Some key features that dif-ferentiate RepoGrams from this rich prior work are its focus on commit granularity,avoidance of assumptions about the repository contents, support for comparison ofmultiple projects across multiple metrics, and a highly compact representation.A more recent effort, The Small Project Observatory [39], visualizes ecosys-tems of projects. Complicity [46] is another web-tool that focuses on visualizingvarious metrics on software ecosystems, allowing users to drill down to a singleproject repository or observe the entire ecosystem. The tool’s main view visualizes10ecosystems on a scatter plot, and the users can set various metrics on five differentdimensions: x/y coordinates, width/height or box, and color of box. RepoGramsdiffers from these two works in its emphasis on a single unifying view for all met-rics and a focus on supporting SE researchers.Another set of related work proposes tools to visualize certain software metrics.Some of these metrics are similar to the ones we have implemented in RepoGrams(Table 4.1). In contrast with this prior work our goal with RepoGrams is to supportSE researchers. Additionally, RepoGrams is designed to support multiple metricsand to unify them within a single repository footprint visualization abstraction.ConcernLines [61] is a tool to visualize the temporal pattern of co-occurringsoftware concerns. Similarly to RepoGrams it plots the magnitude of concerns ona timeline and uses color to distinguish high and low concern periods. RepoGramscan be extended with a metric that counts concerns expressed in commit messagesor in code comments. Fractal Figures [19], arguably the tool most visually similarto RepoGrams, visualizes commit authors from software repositories in ConcurrentVersions System (CVS), using either one of two abstractions. RepoGrams’s repos-itory footprint abstraction is similar to Fractal Figures’ abstraction called TimeLineView, which assigns a unique color to each commit author and lays all commits ascolored squares on horizontal lines. Similarly to RepoGrams, each horizontal linerepresents a single software repository, and progression from left to right repre-sents the passage of time. RepoGrams includes support for multiple metrics basedon the artifacts exposed by the source repository, and we included a metric thatassigns a unique color per author.Visualization techniques in other domains bear relevance to RepoGrams. Chro-mograms [65] and history flow [63] are technique to visualize editor activity inWikipedia. Chromograms uses color blocks arranged in a sequence but differsin its focus on a single metric to map text into colors. History flow resemblesSeesoft [25] with the additional dimension of time. Begole et al. use actogramsto visualize human computer activity [11]. In actograms a sequence of coloredblocks denote activity over time. However, actograms capture activity over real-time while RepoGrams’s focus is on capturing the sequence, and actograms is notdesigned for studying software activity.11Chapter 3Project selection approaches inSE literatureTo understand how SE researchers select software projects to act as the evaluationtargets for their studies, we conducted a survey of recent papers published in SEvenues. We used a grounded theory approach, which is an inductive methodologyto construct theories through the analysis of data [58]. This chapter is organizedas follows: in Section 3.1 we describe the protocol that we followed during thisliterature survey. In Section 3.2 we describe the final version of the codebookthat we developed, in Section 3.3 we summarize the results. The full protocol,codebook, and raw results are listed in Appendix A.3.1 ProtocolWe started by reviewing a subset of research papers from large SE conferences thatinclude research tracks where authors describe tools or methods that operate onsoftware projects. We chose 30 and 29 random research-track papers from ICSE2014 and MSR 2014, respectively. In reading these papers, we considered theselection strategy employed by the authors of each paper for choosing evaluationtargets (if any) based on the text in the paper. We then iteratively derived a code-book (a set of codes, or categories) to characterize these strategies. Further, weconsidered the number of evaluation targets and which artifacts were studied bythese papers.12Table 3.1: SE conferences that we reviewed in our literature survey.Conference Year# papers(of 55)Foundations of Software Engineering (FSE) 2012 10Mining Software Repositories (MSR) 2012 5Automated Software Engineering (ASE) 2013 5Empirical Software Engineering and Measurement (ESEM) 2013 5International Conference on Software Engineering (ICSE) 2013 10Mining Software Repositories (MSR) 2013 10Foundations of Software Engineering (FSE) 2014 10Next, using this initial codebook, three more researchers from University ofBritish Columbia (UBC) joined to iterate on the codebook by reading and dis-cussing a random set of 55 research track papers from six SE conference proceed-ings in the years 2012–2014 (either five or ten papers from each conference). Thesurveyed conferences are summarized in Table 3.1.We met several times to discuss between five and ten papers from one or twoconferences at each meeting. Our discussions frequently caused us to update thecodebook. By the end of each meeting, we derived a consensus on the set of codesfor the discussed papers. As part of the meeting we also re-coded the previouslydiscussed papers when changes to the codebook required doing so.3.2 CodebookThe final version of the codebook includes the following five dimensions. Wecoded each paper in each of these five dimensions with the exceptions of papersthat received the IRR code in the selection criteria dimension, as described below.• Selection criteria. The type of criteria that the authors used to select projectsfor evaluation targets in a given paper. The resulting codes from this dimen-sion are summarized in Table 3.2). For example, the code DEV stands for“Some quality of the development practice was required from the selectedevaluation targets. This quality does not necessarily have to be a uniquefeature, it could be something common such as the existence of certain datasets, usage of various aid tools such as an issue trackers, etc.” Hence we13applied this code to those papers whose authors explained that the develop-ment process of the evaluation targets in the described research exhibited aparticular process or used a particular set of tools that the authors deemednecessary for their evaluation.• Project visibility. Captures the availability of project data, particularly theavailability of the project artifacts, that were used in the paper (e.g., whetherit is available online as open source, or is restricted to researchers through anindustrial partner). The resulting codes from this dimension are summarizedin Table 3.3.• Analyzes features over time. A binary dimension (yes or no) to determinewhether the authors analyzed project features over time or whether a singlestatic snapshot of the project data was used.• Number of evaluation targets. The number of distinct evaluation targets usedin the paper’s evaluation. Note that we recorded the number of targets thatthe authors claim to evaluate as some targets can be considered to be a singleproject or many projects. For example, Android is an operating system withmany sub-projects: one paper can evaluate Android as a single target, whileanother paper can evaluate the many sub-projects in Android.• Evaluated artifacts keywords. Encodes the artifacts of the evaluation targetsused in the paper’s evaluation (e.g., source code, issues in an issue trackingsystem, runtime logs).In our study we found multiple cases in which more than one selection crite-ria code was applicable. For example, some papers relied on two distinct sets ofprojects (e.g., one set of projects was used as training data for some tool while an-other set of projects was used for evaluating that same tool). We therefore allowedmultiple selection criteria codes for one paper. Table 3.2 lists the selection criteriacodes, and the number of times each code was applied in the set of 55 papers thatthe four co-authors coded.14Table 3.2: Selection criteria codes and frequencies from our literature survey.Code Description#papers(of 55*)QUAThe authors used informal qualities of the evaluation targetsin their selection process. e.g., qualities such as age, code-base size, team composition, etc. The qualities are not de-fined strictly and there is no obvious way to apply a yes/noquestion that determines whether a new evaluation targetwould fit the selection criteria.18Example: “We have analyzed applications using widely-used components (such as the IE browser components andthe Flash Player) and evaluated how our chosen referenceprograms and test subjects differ in terms of policy config-urations under various workloads. Table 1 gives the de-tailed information on the analysis of the IE browser com-ponents” [37]DEVSome quality of the development practice was required fromthe selected evaluation targets. This quality does not neces-sarily have to be a unique feature, it could be somethingcommon such as the existence of certain data sets, usage ofvarious aid tools such as an issue trackers, etc.17Example: “For this study we extracted the Jira issues fromthe XML report available on the Apache Software Founda-tion’s project website for each of the projects.” [49]REFReferences an existing and specific source of evaluation tar-gets, such as another paper that has evaluated a similar tech-nique/tool on a repository.17Example: “We evaluate our technique on the same searchgold set used by Shepherd et al.” [68]Continued on next page. . .15. . . Continued from previous pageCode Description#papers(of 55*)DIVThe authors mention diversity, perhaps not by name, as oneof the features of the selected evaluation targets.9Example: “In this study, we analyze [. . . ] three softwaresystems [. . . ] [that] belong to different domains” [67]ACCThe authors had unique access to the evaluation targets, suchas a software that is internal to the researching company. Notalways explicit but sometimes implied from the text.2Example: “The case organization had been developinga telecommunications software system for over ten years.They had begun their transformation from a waterfall-likeplan-driven process to an agile process in 2009.” [32]METRandom or manual selection based on a set of well-definedmetrics. There is a well defined method to decide whethera new given project would fit the selection criteria. METcan be used for selection artifacts, but it must also provideconstraints that also (perhaps implicitly) select the projects.1Example: “. . . we created a sample of highly discussed pullrequests . . . We defined ”highly discussed” as pull requestswhere the number of comments is one standard deviation(6.7) higher than the mean (2.6) in the dataset, filtering outall pull requests with less than 9 comments in the discus-sion.” [62]UNK Papers that do not provide an explanation of the selectionprocess. This code is exclusive, and cannot be applied to thesame set of evaluation targets if other codes were applied tothat set.2Continued on next page. . .16. . . Continued from previous pageCode Description#papers(of 55*)IRR Papers that are irrelevant to our focus: evaluation does notuse projects or does not analyze repository information.This code is exclusive, and cannot be applied to a paper ifother codes were applied to that paper.15* Number of papers does not add up to 55 since multiple codes can be applied to each paper.3.3 ResultsThe raw results of this literature survey are listed in the Appendix at Section A.3.We proceed to summarize these results.Among the 55 papers coded by all four researchers, we used a code other thanIRR on 40 (73%) papers. In the rest of this report we consider these 40 relevantpapers as our global set.Based on Table 3.2 we find that the three top selection criteria codes — QUA,DEV, and REF — had almost identical frequency at 18, 17, and 17 papers each(45%, 43%, and 43% respectively). That is, to select their evaluating targets the SEpapers we considered relied on (1) qualitative aspects of the projects, (2) particulardevelopment practices, and (3) targets from previously published research. Wefound that 28 papers (70%) were coded with QUA and/or DEV. These two codesshow that the majority of authors perform an ad-hoc selection of evaluation targets.When analyzing which artifacts were evaluated we found that 21 (53%) of the 40papers evaluated the targets’ source code or related artifacts such as patches or codeclones. We propose that a tool that assists authors with the selection process of theirevaluation target should inquire into informal metrics on both source code relatedartifacts of the projects themselves, and on artifacts relating to the developmentprocess of the projects.Based on Table 3.3 we see that the vast majority of authors, at 36 papers (90%),prefer to run their evaluation on publicly available artifacts, such as the source codeof open source projects. Industrial collaborations are a minority at 5 papers (13%).17Table 3.3: Project visibility codes and frequencies from our literature survey.Code Description# papers(of 55)*PUBProjects were selected from a publicly available repository.Most likely open source, but not necessarily. Others candownload the source code, binary, and/or data and run theevaluation themselves.36Example: “In this work, we analyze clone genealogies con-taining Type-1, Type-2, and Type-3 clones, extracted fromthree large open source software systems written in JAVA,i.e., ARGOUML, APACHE-ANT, and JBOSS.” [67]INDIndustrial/company project. A collaboration with an indus-trial partner where the authors use their project. e.g., a com-pany that performs research (e.g., Microsoft Research, Or-acle Labs) and uses in-house projects. Can be an explicitmention of industrial partner (“we worked with Microsoft”)or a mention of a proprietary project.5Example: “In our work, we applied four unsupervised ap-proaches [. . . ] to the problem of summarization of bug re-ports on the dataset used in [34] (SDS) and one internalindustrial project dataset (DB2-Bind)” [40]CONProject that the authors have complete control over, e.g., anew project from scratch solely for the purpose of the study,student projects.2Example: “We conducted three different developmentprojects with undergraduate students of different durationand number of participating students.” [22]UNK When no details on the projects’ visibility was given 1IRR When the selection criteria is IRR 15* Number of papers does not add up to 55 since multiple codes can be applied to each paper.18There can be several reasons why most authors choose to run their evaluations onpublicly available artifacts, e.g., reproducibility of the evaluation, the community’sfamiliarity with the evaluation target, ease of access to the data. Our tool shouldtherefore focus on assisting authors in filtering potential evaluation targets out ofvast repositories of public software projects, such as GitHub.We also found that 16 papers (40%) analyzed their evaluation targets over time,indicating that many researchers are interested in studying changes over time andnot just a snapshot. Our tool should have the capacity to consider temporal infor-mation about software projects.Figure 3.1: Frequency of the number of evaluation targets by number of pa-persFinally, considering all 114 total surveyed papers (both in the initial seedingprocess of the codebook and the joint coding process), we found that 84 papers(74%) performed an empirical evaluation on some artifacts of software projects(i.e., non-IRR), and 63 of these 84 papers (75%) evaluated their work with 8 orfewer evaluation targets. See Figure 3.11 for the frequency of the number of eval-1For one paper it was unclear whether the authors evaluated 2 or 3 targets. In this chart it wascounted as 3. One paper was not included in this chart since it neglected to mention the final numberof evaluation targets.19uation targets by number of papers. Evaluations of entire large datasets are notas common as evaluations involving only a handful of targets. Our tool shouldsupport authors in small scale evaluations.As mentioned in Chapter 2, some tools exist that are designed to assist re-searchers with the selection of large-scale sets of evaluation targets (i.e., hundredsor even hundreds of thousands), such as GHTorrent [30]. However, we found thatthe vast majority of SE researchers prefer to evaluate their work on small, manu-ally curated sets. One might wonder why most SE researchers prefer working withsmalls sets of evaluation targets. We found no direct answer to that question duringour literature survey. However, having read these 55 papers we came up with thefollowing two hypotheses: (1) in most papers we found that the authors require anin-depth analysis of each target to answer their research questions. These analysesare often manual processes which would not be possible were they analyzing thou-sands of evaluation targets. These questions do not gain from quantity but ratherfrom the quality of the analysis. Working with a larger set will often increase theamount of busywork that the authors perform. (2) testing for diversity is difficult asits measure depends on the chosen metrics. A set of 8 evaluation targets might beconsidered diverse according to hundreds of different metrics, and another set ofthousands of evaluation targets might be considered to lack diversity according tothose same metrics. Hence, adding more evaluation targets to the set will not nec-essarily increase the SE community’s confidence that the evaluated tool or methodsatisfies validity or claims to generalizability. Both of these hypotheses deservesfurther study, such as a survey of SE researchers.Overall through this literature study we found that the process by which SEresearchers choose their evaluation targets is often haphazard and rarely describedin detail. This process could be supported by a tool designed to help SE researcherscharacterize and select software repositories to use as evaluation targets. Sucha tool could also help the broader SE community better understand the authors’rationale behind a particular choice of evaluation targets.20Chapter 4RepoGrams’s design andimplementationBased on the results of our literature survey we set out to create RepoGrams, atool to understand and compare the evolution of multiple software repositories.RepoGrams is primarily intended to assist SE researchers in choosing evaluationtargets before they conduct an evaluation of a tool or a method as part of theirresearch projects. RepoGrams has three key features, each of which is groundedin our literature survey. First, RepoGrams is designed to support researchers inproject selection. RepoGrams supports comparison of metrics for about a dozenprojects (75% of papers evaluated their work with 8 or fewer evaluation targets).Second, it is designed to present multiple metrics side-by-side to help characterizethe software development activity in a project overall (70% of papers used infor-mal qualities to characterize their evaluation targets). Third, RepoGrams capturesactivity in project repositories over time (we found that 40% of papers considersoftware evolution in their evaluations) In the rest of this chapter we explain Re-poGrams’s design and implementation.4.1 DesignWe designed RepoGrams as a client-server web application, due to the convenienceof use that such platforms provide to end users. Figure 4.1 shows a screenshot of aRepoGrams session with three added projects and two selected metrics.21Figure 4.1: RepoGrams interface: (1) input field to add new projects, (2) but-ton to select the metric(s), (3) a repository footprint corresponding to aspecific project/metric combination. The color of a commit block repre-sents the value of the metric on that commit, (4) the legend for commitvalues in the selected metric(s), (5) zoom control, (6) button to switchblock length representation and normalization mode, (7) buttons to re-move or change the order of repository footprints, (8) way of switchingbetween grouping by metric and grouping by project (see Figure 4.4),(9) Tooltip displaying the exact metric value and the commit message(truncated), (10) metric name and descriptionRepoGrams is designed to support the following workflow: the user startsby importing some number of project repositories. She does this by adding theprojects’ Git repository URLs to RepoGrams ( 1 in Figure 4.1). The server clones1these Git repositories and computes metric values for all the commits across all ofthe repositories. Next, the user selects one or more metrics ( 2 in Figure 4.1). Thiscauses the server to transmit the precomputed metric values to the client to display.The metric values are assigned to colors and the interface presents the computedproject repository footprints to the user ( 3 in Figure 4.1) along with the legendfor each metric ( 4 in Figure 4.1).1In Git nomenclature, “cloning” is the process of copying an entire Git repository with all itshistory and meta-data to the local machine.22RepoGrams currently requires that researchers manually add repositories thatthey already know, and base their selection on those. We discuss one idea to over-come this limitation in Section Visual abstractionsWe designed several visual abstractions to support tasks in RepoGrams, these are:• Repository footprint. RepoGrams visualizes one or more metrics over thecommits of one or more project repositories as a continuous horizontal linethat we call a repository footprint, or footprint for short. (Figure 4.1 showssix repository footprints, two for each of the three project repositories). Thefootprints are displayed in a stack to facilitate comparison between project-s/metrics. A footprint is composed of a sequence of commit blocks. Re-poGrams serializes the commits across all branches of a repository into afootprint using the commits’ timestamps.• Commit block. Each individual commit in the Git repositories is repre-sented as a single block. The user selects a mode that determines what thewidth of the block will represent (see next bullet point). The metric valuecomputed for a commit determines the block’s color (see last bullet point).• Block width. The length of a each commit block can be either a constantvalue, a linear representation of the number of LoC changed in the commit,or a logarithmic representation of the same. We also support two normal-ization variants:– project normalized. All lengths are normalized per project to utilize thefull width available in the browser window. This mode prevents mean-ingful comparison between projects if the user is interested in contrast-ing absolute commit sizes. The footprints in Figure 4.1 use this mode– globally normalized. Block lengths are resized to be relatively compa-rable across projects.All six possible combinations are demonstrated in Figure 4.2.23• Block color. A commit block’s color is determined by a mapping functionthat is defined in the metric’s implementation. This process is described indetail in Section 4.1.2.Figure 4.2: All six combinations of block length and normalization modesModes Repository footprintsFixedGloballyProjectLinearGloballyProjectLogarithmicGloballyProjectEach cell contains two repository footprints that represent two artificiallygenerated projects. The top repository footprint is of a repository with 6 commitsin total, having 1, 2, 3, 4, 5, and 6 LoC changed. The bottom repository footprintis of a repository with 5 commits in total, having 1, 2, 4, 8, and 16 LoC changed.4.1.2 Mapping values into colors with bucketsThe computed values for each commit in each metric is mapped to a specific colorwith a buckets metaphor. For each metric we map several values together into abucket as described below, and each bucket is assigned a color. Thus, the processof assigning a color for a commit block is to calculate the commit’s value in themetric, match that value to a bucket, and color the commit block based on thebucket’s matching color. The addition of a new repository to the view can causesome buckets to be recalculated, which will cause computed values to be reassignedand commit blocks to be repainted a different color.A legend of each mapping created by the this process is displayed next to eachselected metric ( 4 in Figure 4.1). For example, the second metric shown in Fig-ure 4.1 is author experience. Using this example we can see that the most expe-rienced author in the sqlitebrowser repository committed between 383–43724commits in this repository, as can be seen in the latest commits of that project(left-most commit blocks). In contrast, no author committed more than 218 com-mits in the html-pipeline repository, and no author committed more than 382commits in the postr project.Figure 4.3: Examples of legends generated from buckets for the Languagesin the Commit, Files Modified, and Number of Branches metrics.Figure 4.3 contains examples of three more legends, representing buckets thatwere generated from three metrics: Languages in the Commit, Files Modified, andNumber of Branches. The buckets change automatically to match the repositoriesthat were added by the user. RepoGrams currently supports three types of buckets:• fixed buckets the metric has 8 buckets of predefined ranges. For example,the commit message length metric uses this bucket type. Buckets of thesetype do not change when new repositories are added. In the case of thecommit message length metric the bucket ranges are <[0–1], [2–3], [4–5],[6–8], [9–13], [14–21], [22–34], [35–∞)>. Thus, two commit having 4 and50 words in their commit message will be matched to the 3rd and 8th bucket,respectively, and their commit blocks will be colored accordingly.An important benefit of fixed buckets is that adding or removing repositoriesfrom the view will not change the color of other repositories’ commit blocks.On the other hand, outliers will be bundled together in the highest valuedbucket. For example, a commit with a message length of 50 words will bebundled with a commit with a message length of 10,000 words due to theaforementioned fixed ranges.The bucket colors for this type are ordered. They follow a linear progressionsuch as increasing brightness on a single hue or a transition between twodifferent hues.25• uniform buckets the metric has up to 8 buckets of equal or almost equal size,based on the largest computed value for that metric across all of the repos-itories. The buckets cannot always be of equal size due to integer divisionrules. For example, the languages in the commit metric uses this bucket type.If the highest value across all repositories is 7 or 12 then the bucket rangesare <{0}, {1}, {2}, {3}, {4}, {5}, {6}, {7}> or <[0–1], [2–3], {4}, [5–6],[7–8], {9}, [10–11], [12–13]>, respectively.The distribution of ranges within these uniform buckets changes wheneverthe maximal value of a commit in the metric changes with the addition orremoval of a repository to the view. Since the visualization is not stable thesame color can represent one value at some point and another value afteradding a new repository to the view. Whether this is an advantage or a disad-vantage is up to the researcher. A more obvious disadvantage of this buckettype is that outliers can skew the entire visualization towards one extremevalues. For example, if all commit values are within the range 1–10 exceptone commit with a value of 1,000, then using uniform buckets will causeall commits except the outlier to be placed in the lowest bucket, colored thesame. We discuss potential solutions to this issue in Chapter 6.The bucket colors for this type are also ordered. Some metrics use a slightlymodified version of this bucket type, such as having a separate bucket justfor zero values and 7 equal/almost equal buckets on the remaining values, orstarting from 1 instead of 0. These modifications are exposed in the legend.• discrete buckets unlike the other two bucket types which deal with assigningnumeric values from a linear or continuous progression of values to buckets,the discrete bucket types deal with discrete values. e.g., the commit authormetrics assigns each unique commit author its own unique bucket.The bucket colors for this type are categorical. Each bucket in this type hasa completely different color to facilitate differentiation between the discretevalues. The number of discriminable colors is relatively small, between sixand twelve [43]. A metric that uses this type should limit the number ofdiscrete values to twelve. This is not always possible. For example, a projectrepository might have hundreds of developers. Without a scheme to bundle26these developers together into shared buckets the commit author metric willhave to display hundreds of colors. Solutions to this issue are metric-specific.Figure 4.4: RepoGrams supports two ways of grouping repository footprints:(a) the metric-grouped view facilitates comparison of different projectsalong the same metric, and (b) the project-grouped view facilitates com-parison of the same project along different metrics.Project 1 : Metric AMetric BProject 2 : Project 1 : Project 2 : Metric A : Project 1Project 2Metric B : Metric A : Metric B : (a) Metric-grouped view (b) Project-grouped view4.1.3 Supported interactionsThe RepoGrams interface supports a number of interactions. The user can:• Scroll the footprints left to right and zoom in and out ( 5 in Figure 4.1)to focus on either the entire timeline or a specific period in the projects’histories. In projects with hundreds or thousands of commits, some com-mit blocks might become too small to see. By allowing the user to zoomand scroll we enable them to drill down and explore the finer details of thevisualization.• Change the block length mapping and normalization mode ( 6 in Fig-ure 4.1) as described in Section 4.1.1. The different modes emphasize dif-ferent attributes, such as the number of commits or the relative size of eachcommit.27• Remove a project or move a repository footprint up or down ( 7 in Fig-ure 4.1). By rearranging the repository footprints a user can visually derivead-hoc groupings of the selected project repositories.• Change the footprint grouping ( 8 in Figure 4.1) to group footprints bymetric or by project (see Figure 4.4). The two modes can help the userfocus on either comparisons of metrics within each project, or comparisonsof projects within the same metric.• Hover over or click on an individual commit block in a footprint to see thecommit metric value, commit message, and link to the commit’s page onGitHub ( 9 in Figure 4.1). This opens a gateway for the user to explore thecause of various values, such as when a user is interested in understandingwhy a certain commit block has an outlier value in some selected metric.4.2 Implementation detailsRepoGrams is implemented as a client-server web application using a number ofopen source frameworks and libraries, most notably CherryPy [2] and Pygit2 [6] onthe server side and AngularJS [1] on the client side. The server side is implementedmostly in Python, while the client side, as with all contemporary web application,is implemented in HTML5, CSS3, and JavaScript.For convenience of deployment, RepoGrams can generate a Docker image thatcontains itself in a deployable format. Docker is an open platform for distributedapplications for developers and system administrators [3] that enables rapid andconsistent deployment of complex applications. By easing the deployment processwe empower researchers who are interested in extending RepoGrams’s functional-ity to focus exclusively on their development efforts.Each metric is implemented in two files. The first file is the server side imple-mentation of the metric in Python. This file declares a single function, the name ofwhich serves as the metric’s machine ID. The function takes one argument, a graphobject that represents a Git repository, where each vertex in the graph represents acommit in that repository and contains commit artifacts. e.g., the commit log mes-sages, the commit authors. The graph object has a method to iterate all the commit28nodes in temporal order. The function returns an ordered array containing the com-puted value for each commit in the temporal order of the commits. The second fileis the client side implementation of the metric in JavaScript. This file declaresmeta-data about the metric: its name, description, icon, colors, and which mapperfunction the metric uses. It also defines a function to convert the raw computervalue to human readable text to display in the tooltip ( 9 in Figure 4.1).Some metrics might require a new mapper function. These are defined simi-larly by adding a single JavaScript file to the mapper directory in the application.A mapper is an object with two functions: updateMappingInfo and map. Thefunction updateMappingInfo takes as argument an array with all the raw val-ues returned from the server. It then calculates any changes to the buckets andreturns true or false to indicate whether the buckets were modified at all. The func-tion map takes as arguments a raw value as calculated by the server and a list ofcolors from the metric and returns the color that is associated with the equivalentbucket based on the work performed by updateMappingInfo earlier.For a metric to be included and activated in a deployment of RepoGrams itmust be registered in the Python base package file ( init .py) of the metricsdirectory. By allowing deployers to modify which metrics are included we cansupport specific uses for RepoGrams that only require a subset of the existing met-rics. Chapter 6 discusses an example of such a case for using RepoGrams as aneducational tool.4.3 Implemented metricsAs of this writing, RepoGrams has twelve built-in metrics. We list them in alpha-betical order in Table 4.1 and describe them after the table. The Bucket column liststhe type of mapping function used to assign a color to a metric value, as describedearlier in Section 4.1.2. The Info column lists the type of information exposed inthis metric. The metrics that we have developed so far can be categorized accordingto the kind of information they surface:• Artifacts information. e.g., computed values using the source code• Development process information. e.g., computed values about commit times• Social information. e.g., who the commit author was29The Dev. column lists who developed this specific metric. This is either theoriginal development team (Team) that developed RepoGrams’s earlier versions,or one of two developers (Dev1 or Dev2) who developed metrics in a controlledsettings to estimate the effort involved in adding a new metric to RepoGrams. Thisexperiment is described in Section 5.3. The LoC column in the table representsthe lines of code involved in the server-side calculation of a metric. Client-sidecode is not counted as it mostly consists of meta-data. The Time column lists theamount of time spent by a developer to add the metric. This was not counted forthe original team since the development of these metrics was not conducted in acontrolled setting.Table 4.1: Alphabetical list of all metrics included in the current implemen-tation of RepoGrams.Name Bucket Info Dev. LoC TimeAuthor Experience Uniform Social Dev2 8 26 minBranches Used Discrete Development Team 5 —Commit Age Fixed Development Dev1 7 48 minCommit Author Discrete Social Dev1 34 52 minCommit Localization Fixed Artifacts Team 13 —Commit Message Length Fixed Development Team 6 —Files Modified Fixed Artifacts Dev2 6 42 minLanguages in a Commit Uniform Artifacts Team 15 —Merge Indicator Uniform Development Dev2 5 44 minMost Edited File Uniform Artifacts Team 11 —Number of Branches Uniform Development Team 47 —POM Files Uniform Artifacts Dev1 6 30 minIt should be noted that these metrics are in no way comprehensive and usablefor all researcher and research purposes. RepoGrams’s power comes not from itscurrent set of metrics, but rather from its extensibility (as described in Section 4.2).Half of the metrics listed above were created during the experiment described inSection 5.3.30It is possible that these existing metrics will eventually create a bias amongresearchers regarding the selection process. However, as there is currently no tooldesigned for the same purpose as RepoGrams, we believe that the inclusion of thesemetrics is an improvement over the current state of the art in which researchersdo not currently base their selection on evidence. Mitigating this potential biasremains a problem for future researchers.We proceed to describe each metric. For each metric we also provide an exam-ple that exhibits why researchers might care about this metric.• Author Experience. The number of commits a contributor has previouslymade to the repository. For example, a researcher interested in studyingdeveloper seniority across software projects can use this metric to chooseprojects that exhibit different patterns of author experience. e.g., similarbetween developers vs. skewed for a minority of developers in the team. Thismetric was added based on a suggestion by one of the participants in the SEresearchers study. See Section 5.2.3• Branches Used. Each implicit branch [13] is associated with a unique color.A commit is colored according to the branch it belongs to. For example,a researcher interested in studying whether projects exhibit strong branchownership by individual developers can correlate this metric with the Com-mit Author metric.• Commit Age. Elapsed time between a commit and its parent commit. Formerge commits we consider the elapsed time between a commit and itsyoungest parent commit. For example, a researcher interested in exploringwhether a correlation exists between the elapsed time that separates com-mits, and the likelihood that the latter commit is a bug-introducing commit,can use this metric to select projects that contain different patterns of commitages.• Commit Author. Each commit author is associated with a unique color. Acommit block is colored according to its author. For example, a researcherinterested in studying the influence of dominant contributors on minor con-31tributors in open source projects can begin their exploration by using thismetric to identify projects that exhibit a pattern of one or several dominantcontributors.• Commit Localization. Fraction of the number of unique project directoriescontaining files modified by the commit. Metric value of 1 means that allthe modified files in a commit are in a single directory. Metric value of0 means all the project directories contain a file modified by the commit.For example, researchers interested in cross-cutting concerns could use thismetric to search for projects to study. A project with many commits that havea low value of localization can potentially have a high level of cross-cuttingconcerns.• Commit Message Length. The number of words in a commit log message.For example, a researcher interested in finding whether a correlation existsbetween the commit message lengths and developer experience can comparethis metric with the Author Experience metric and select projects that exhibitdifferent patterns for further study.• Files Modified. The number of files modified in a particular commit, in-cludes new and deleted files. For example, a researcher interested in study-ing project-wide refactoring operations can use this metric to find pointsin history where a large number of files were modified in a repository. Alarge number of files modified could potentially indicates that this event hasoccurred. This metric was added based on a suggestion by one of the partic-ipants in the SE researchers study.• Languages in a Commit. The number of unique programming languagesused in a commit based on filenames. For example, a researcher interestedin studying the interaction between languages in different commits can usethis metric to identify projects that have many multilingual commits.• Merge Indicator. Displays the number of parents involved in a commit. Twoor more parents denote a merge commit. For example, a researcher may beinterested in studying projects with many merge commits. This metric can32reveal whether a project is an appropriate candidate for such a study. Thismetric was added based on a suggestion by one of the participants in the SEresearchers study.• Most Edited File. The number of times that the most edited file in a commithas been previously modified. For example, a researcher interested in study-ing “god files” can use this metric to identify projects where a small numberof files have been edited multiple times over a short period. The existence ofsuch files potentially indicates the existence of “god files”.• Number of Branches. The number of branches that are concurrently activeat a commit point. For example, a researcher interested in studying howand why development teams change the way they use branches can use thismetric to identify different patterns of branch usage for further exploration.• POM Files. The number of POM files changed in every commit. For ex-ample, a researcher interested in exploring the reasons for changes to theparameters of the build scripts of projects can use this metric to find pointsin history where those changes occurred.The POM Files metric is an example of a specific case that can be generalized.In this case, to highlight edits to files with a user-determined filename pattern.Customizable metrics are discussed as future work in Section 6.133Chapter 5EvaluationWe conducted two user studies and an experiment to answer the research ques-tions we posed earlier in Section 1.1. This chapter describes these evaluations anddiscusses the results.For convenience we repeat the research questions here. For a more detaileddiscussion of these research questions see Section 1.1:• RQ1: Can SE researchers use RepoGrams to understand and compare char-acteristics of a project’s source repository?• RQ2: Will SE researchers consider using RepoGrams to select evaluationtargets for experiments and case studies?• RQ3: How usable is the RepoGrams visualization and tool?• RQ4: How much effort is required to add metrics to RepoGrams?The rest of this chapter is organized as follows: Section 5.1 details a user studywith undergraduate students that answers RQ3, Section 5.2 details a user studywith SE researchers that answers RQ1 and RQ2, and finally Section 5.3 details acase study that answers RQ4.345.1 User study with undergraduate studentsIn this first evaluation, a user study with undergraduate students, we aimed to deter-mine if individuals less experienced with repositories and repository analysis couldcomprehend the concept of a repository footprint and effectively use RepoGrams(RQ3). The study was conducted in a fourth year software engineering class: Atotal of 91 students participated and 74 students completed the study. Participationin the study in class was optional. We incentivized participation by raffling off five$25 gift cards for the university’s book store among the participants that completedthe study.5.1.1 MethodologyThe study consisted of two parts: a 10 minute lecture demonstrating RepoGrams,and a 40 minute web-based questionnaire. The questionnaire asked the participantsto perform tasks with RepoGrams and answer questions about their perception ofthe information presented by the tool.The questionnaire had three sections1: (1) a demographics section to evaluatethe participants’ knowledge and experience, (2) four warm-up questions to intro-duce participants to RepoGrams and (3) ten main questions in three categories:• Metric comprehension. Six questions to test if participants understood themeanings of various metrics.• Comparisons across projects. Three questions to test if the participantscould recognize patterns across repository footprints to compare and con-trast projects and to find positive or negative correlations between them.• Exploratory question. One question to test whether participants could trans-late a high-level question into tasks in RepoGrams.Before each of the main questions, participants had to change selected metricsand/or block length modes. A detailed explanation on the metrics used in eachquestion was provided.1The full questionnaire is listed in Appendix B.35The questions were posed in the context of 5 repositories selected from 10random projects from GitHub’s trending projects page that were open source andhad up to 1,500 commits. From those 10 projects we systematically attemptedpermutations of 5 projects2 until we found a permutation such that all 5 repositoryfootprints fit the ten main questions from the study. We established ground truthanswers for each question. The final set of project repositories in the study had amin / median / max commit counts of 581 / 951 / 1,118, respectively.5.1.2 ResultsWe received 74 completed questionnaires from the 91 participants. These 74 par-ticipants answered a median of 8 of the 10 answers correctly. The median timeto complete a metric comprehension question was 1:20 min, comparison acrossprojects was 1:32 min, and the exploratory question was 2:51 min. In total, partic-ipants took on median 14:10 min to answer the main questions. The success withwhich participants answered questions in relatively short time provides evidencethat RepoGrams is usable by a broad population (RQ3). Interestingly, we foundno significant correlation between a participants’ success rate and their industry orVersion Control System (VCS) experience.To provide more insight into how these users fared, we highlight the results fortwo questions; we do not discuss the results of the other eight questions in detail.Question 5, is an example of a metric comprehension question that asked:“Using the Languages in a Commit metric and any block length, which projectis likely to contain source code written in the most diverse number of differentlanguages?” (94% success rate)In this task the participants were shown 5 repository footprints, as seen in Fig-ure 5.1. Our ground truth consisted of two answers that were visually similar: (1)a footprint that had one commit block in the 16–18 range (chosen by 48 (72%) ofparticipants), and (2) a footprint that had two commits block in the 14–15 range(chosen by 15 (22%) of participants). The remaining three footprints had all theircommit blocks in the 5–6 range or lesser. The high success rate for this question in-2A 6th project was later taken at random. Its repository footprint was to be removed by theparticipants at the beginning of the study as part of a task intended to familiarize the participantswith the interface.36dicates that the users were able to comprehend the metric presented by RepoGramsand to find patterns and trends based on the repository footprints of projects.Figure 5.1: RepoGrams showing the repository footprints as it was during theuser study with undergraduate students, question 5.Question 12 is an example of a comparisons across projects question: “Us-ing the languages in a commit metric and the fixed block length, which two projectrepositories appear to have the most similar development process with each other?”(81% success rate)In this question, we asked the participants to explain their choice of the tworepositories. We then coded the answers based on the attributes the participantsused in their decision. For each question we created at least two codes, one codeindicates that the explanation was focused on the metric values (e.g., “These twoprojects stick to at most 2 languages at all times in their commits. Sometimes,but rarely, they use 3–4 languages as indicated by the commits”), the other codeindicates that the explanation was focused on the visualization (e.g., “The shadingin both projects were very light”). Occasionally an explanation would discuss boththe metric values and the visualization (e.g., “It seems that both languages use asmall number of languages throughout the timeline, since colors used for thoseprojects are mainly light”), in which case we applied both codes. When anothervisual or abstract aspect was discussed in the participant’s explanation we createdcodes to match them.37We found that the participants who discussed the meaning of the metric valueshad a higher success rate (65%) compared to those participants who relied solelyon the visualization (27%). A similar trend is apparent in other question where weasked the participants to explain their answer.5.1.3 SummaryThis user study on individuals with less SE training indicates that RepoGrams canbe used by a broader population of academics with a computer science background.However, when individuals rely on the visualization without an understanding ofthe metric underlying the visualization, mis-interpretation of the data may occur.5.2 User study with SE researchersTo investigate the first two research questions (RQ1 and RQ2), we performeda user study with researchers from the SE community. This study incorporatedtwo parts: first, participants used RepoGrams to answer questions about individualprojects and comparisons between projects; second, participants were interviewedabout RepoGrams. We recruited participants for the study from a subset of au-thors from the MSR 2014 conference, as these authors likely performed researchinvolving empirical studies using software projects as evaluation targets, and manyhave experience with repository information. These authors are the kind of SE re-searchers that might benefit from a tool such as RepoGrams. Some of the authorsforwarded the invitation to their students whom we included in the study.We used the results of the previous user study and the comments given by itsparticipants to improve the tool prior to running this user study with SE researchers.For example, in the first study RepoGrams only supported the display of one metricat a time. Participant comments prompted us to add support for displaying multiplemetrics. We also realized that some labels and descriptions caused confusion andambiguity, we endeavored to clarify their meanings. On the technical side, wefound that due to the server load during the study, performance was a recurringcomplaint. We made significant improvements to make all actions in the tool faster.The study had 14 participants: 5 faculty, 1 post doc, 6 PhD students, and 2masters students. Participants were affiliated with institutions from North Amer-38ica, South America, and Europe. All participants have research experience analyz-ing the evolution of software projects and/or evaluating tools using artifacts fromsoftware projects.Similarly to the undergraduate study, we raffled off one $100 gift card to in-centivize participation. The study was performed in one-on-one sessions with eachparticipant: 5 participants were co-located with the investigator and 9 sessions wereperformed over video chat.5.2.1 MethodologyEach session in the study began with a short demonstration of RepoGrams bythe investigator, and with gathering demographic information. A participant thenworked through nine questions presented through a web-based questionnaire.3The first three questions on the questionnaire were aimed at helping a partici-pant understand the user interface and various metrics (5 minutes limit for all threequestions). Our intent was to ensure each participant gained a similar level of ex-perience with the tool prior to the main questions.The remaining six questions tested whether a participant could use RepoGramsto find advanced patterns. Questions in this section were of the form “group therepositories into two sets based on a feature”, where the feature was implied bythe chosen metric (3–7 minutes limit per question). Table 5.1 lists these questionsin detail. We then interviewed each participant in a semi-structured interview de-scribed in Section 5.2.3.For the study we chose the top 25 trending projects (pulled on February 3rd,2015) for each of the ten most popular languages on GitHub [64]. From this setwe systematically generated random permutations of 1–9 projects for each ques-tion until we found a set of projects such that the set’s repository footprints fit theintended purpose of the questions. The final set of project repositories in the studyhad a min / median / max commit counts of 128 / 216 / 906, respectively.3The full questionnaire is listed in Appendix C.39Table 5.1: Main questions from the advanced user study.# Question# reposi-toryfootprintsDist.4 Which of the following statements is true?There is a general {upwards / constant /downwards} trend to the metric values.15 Categorize the projects into two clusters: (a) projectsthat use Maven (include .pom files), (b) projects thatdo not use Maven.96 Categorize the projects into two clusters: (a) projectsthat used a single master branch before branching offto multiple branches, (b) projects that branched offearly in their development.57 Categorize the projects into two clusters: (a) projectsthat have a correlation between branches and au-thors, (b) projects that do not exhibit this correlation.88 Categorize the projects into two clusters: (a) projectsthat have one dominant contributor, based on num-ber of lines of code changed, (b) projects that do nothave such a contributor. A dominant contributor isone who committed at least 50% of the LoC changesin the project.39 Same as 5, with number of commits instead of num-ber of lines of code changed.35.2.2 ResultsTo give an overall sense of whether SE researchers were in agreement about theposed questions, we use a graphic in the Dist. column of Table 5.1. In this column,each participant’s answer is represented by a block; blocks of the same color de-note identical answers. For example, for question 6, twelve participants chose oneanswer and two participant chose a different answer each; a total of three distinctanswers to that question.The Dist. column of Table 5.1 shows widespread agreement amongst the re-searchers for questions 4 and 5. These questions are largely related to interpret-ing metrics for a project. This quantitative agreement lends support to the under-40standing part of RQ1. More variance in the answers resulted from the remainingquestions that target the comparison part of RQ1; these questions required moreinterpretation of metrics and comparisons amongst projects.To gain more insight into the SE researchers use of RepoGrams, we discusseach of the main questions.Question 4 asked the participants to recognize a trend in the metric value in asingle repository. The majority of participants (12 of 14) managed to recognize thetrend almost immediately by observing the visualization.Figure 5.2: RepoGrams showing the repository footprints as it was during theuser study with SE researchers, question 4.Question 5 asked the participants to identify repositories that have a non-zerovalue in one metric. The participants considered 9 repository footprints where themetric was POM Files: a value of n indicates that n POM files were modified in acommit. This metric is useful for quickly identifying projects that use the Mavenbuild system [4]. All except one participant agreed on the choice for the ninerepositories. This question indicates that RepoGrams is useful in distinguishingrepository footprints that contain a common feature, represented by a particularcolor.Figure 5.3: RepoGrams showing the repository footprints as it was during theuser study with SE researchers, question 5.41Question 6 asked the participants to identify those repositories in which therepository footprints started with a sequence of commit blocks of a particular color.The participants considered 5 repository footprints. The metric was BranchesUsed: each branch is given a unique color, with a specific color reserved for com-mits to the master branch. All five footprints contained hundreds of colors.The existence of a leading sequence of commit blocks of a single color in aBranches Used metric footprint indicates that the project used a single branch atthe start of its timeline or that the project was imported from a centralized versioncontrol system to Git. All participants agreed on two of the footprints and all butone agreed on each of the other footprints. This indicates that RepoGrams is usefulin finding long sequences of colors, even within footprints that contain hundredsof colors.Figure 5.4: RepoGrams showing the repository footprints as it was during theuser study with SE researchers, question 6.Question 7 asked the participants to identify those repositories in which therepository footprints for two metrics contained a correspondence between the col-ors of the same commit block. The participants considered a total of 8 repositoryfootprints, with two metrics for four project. The two metrics were Commit Authorand Branches Used. A match in colors between these two metrics would indicatethat committers in the project follow the practice of having one branch per au-thor. This is useful to identify for those studies that consider code ownership or theimpact of committer diversity on feature development [14].In the task the number of colors in a pair of footprints for the same repositoryranged from a few (<10) to many (>20). The majority (twelve) of participantsagreed on their choices for the first, second, and fourth repository pairs. But, wefound that they were about evenly split on the third repository (eight vs. six partic-42ipants). This indicates that RepoGrams is useful in finding a correlation betweenrepository footprints when the number of colors is low, but it is less effective withmany unique colors.Figure 5.5: RepoGrams showing the repository footprints as it was during theuser study with SE researchers, question 7.Question 8 and 9 asked the participants to estimate the magnitude of non-continuous regions of discrete values. The participants were relatively split onthese results. We conclude that RepoGrams is not the ideal tool for performing thistype of task.Figure 5.6: RepoGrams showing the repository footprints as it was during theuser study with SE researchers, question 8.43Figure 5.7: RepoGrams showing the repository footprints as it was during theuser study with SE researchers, question Semi-structured interviewAfter the participants finished the main tasks, we conduced a semi-structured in-terview to discuss their experiences with RepoGrams. We asked 5 questions, andalloted a maximum of 10 minutes for this part. No interview lasted that long. Thequestions were:• Do you see RepoGrams being integrated into your research/evaluation pro-cess? If so, can you give an example of a research project that you coulduse/could have used RepoGrams in?• What are one or two metrics that you wish RepoGrams included that youwould find useful in your research? How much time would you be willing toinvest in order to write code to integrate a new metric?• In your opinion, what are the best and worst parts of RepoGrams?• Choose one of the main tasks that we asked you to perform. How would youhave performed it without RepoGrams?• Do you have any other questions or comments?Since the interviews were mostly unstructured, participants went back and forthbetween questions when replying to our questions. Hence, the following summaryof all interviews also takes an unstructured form:44Of the 14 participants, 11 noted that they want to use RepoGrams in their fu-ture research: “I would use the tool to verify or at least to get some data on myselected projects” [P12]4 and “I would use RepoGrams as an exploratory tool tosee the characteristics of projects that I want to choose” [P9]. They also sharedpast research projects in which RepoGrams could have assisted them in making amore informed decision while choosing or analyzing evaluation targets. The re-maining 3 participants said that they do not see themselves using RepoGrams intheir research but that either their students or their colleagues might benefit fromthe tool.Most participants found the existing metrics useful: “Sometimes I’m lookingfor active projects that change a lot, so these metrics [e.g., Commit Age] are veryuseful” [P8]. However, they all suggested new metrics and mentioned that theywould invest between 1 hour to 1 week to add their proposed metric to RepoGrams.In Section 5.3 we detail a case study in which we add three of these proposed met-rics to RepoGrams and show that this takes less than an hour per metric. Theproposed metrics ranged from simple metrics like counting the number of mod-ified files in a commit, to complex metrics that rely on third-party services andtools. For example, two participants wanted to integrate tools to compute the com-plexity of a change-set based on their own prior works. Another participant wantedto integrate a method to detect the likelihood that a commit is a bug-introducingcommit. Yet another participant suggested a metric to calculate the code coverageof the repository’s test suite to consider the evolution of a project’s test suite overtime.A few of the suggestions would require significant changes. For example, in-spired by the POM Files metric, two participants suggested a generalized versionof this metric that contains a query window to select a file name pattern. The met-ric would then count the number of files matching the query in each commit. Wediscuss this idea and others in Chapter 6.The participants also found that RepoGrams helped them to identify generalhistorical patterns and to compare projects: “I can use RepoGrams to find generaltrends in projects” [P3] and “You can find similarities . . . it gives a nice overview4We use [P1]–[P14] to refer to the anonymous participants.45for cross-projects comparisons” [P13]. They also noted that RepoGrams wouldhelp them make stronger claims in their work: “I think this tool would be useful ifwe wanted to claim generalizability of the results” [P4].One of our design goals was to support qualitative analysis of software repos-itories. However, multiple participants noted that the tool would be more useful ifit exposed statistical information: “It would help if I had numeric summaries.” and“When I ask an exact numeric question this tool is terrible for that. For aggregatesummaries it’s not good enough” [P6]Another design limitation that bothered participants is the set temporal orderingof commits in the repository footprint abstraction: “Sometimes I would like to orderthe commits by values, not by time” [P7] and “I would like to be able to remove themerge commits from the visualizations.” [P14]. Related to this, a few participantsnoted the limitation that RepoGrams does not capture real time in the sequence ofcommit blocks: “the interface doesn’t expose how much time has passed betweencommits, only their order.” [P7]The participants were asked to choose one of the tasks and explain how theywould solve that task without using RepoGrams. Two generalized approachesemerged repeatedly. The most common approach was to write a custom scriptthat clones the repositories and performs the analysis. One participant mentionedthat their first solution script to solve task 6 (identifying projects that use or haveused Maven) would potentially get wrong results since they intended to only ob-serve the latest snapshot and not every commit from the repository. A softwareproject might have used Maven early in its development and later switched to analternative build system, in which case its latest snapshot would not contain POMfiles and the script would fail to recognize this repository.Alternatively some participants said that they would import the meta-data of theGit repositories into a spreadsheet application and perform the analysis manually.Some participants mentioned that GitHub exposes some visualizations, such as ahistogram of contributors for repositories. These visualizations are per-repositoryand do not facilitate comparisons.465.2.4 SummaryThis study shows that SE researchers can use RepoGrams to understand character-istics about a project’s source repository and that they can, in a number of cases, useRepoGrams to compare repositories (RQ1), although the researchers noted areasfor improvement. Through interviews, we determined that RepoGrams is of imme-diate use to today’s researchers (RQ2) and that there is a need for custom-definedmetrics.5.3 Estimation of effort involved in adding new metricsThe SE researchers who participated in the user study described in the previoussection had a strong interest in adding new metrics to RepoGrams. Because re-searchers tend to have unique research projects that they are interested in evaluat-ing, it is likely that this interest is true of the broader SE community as well. In thislast study we evaluated the effort in adding new metrics to RepoGrams (RQ4).The metrics were implemented by two junior SE researchers: (Dev1) a mas-ters student who is the author of this thesis, and (Dev2) a fourth year ComputerScience undergraduate student. Dev1 was, at the time, not directly involved in theprogramming of the tool and was only slightly familiar with the codebase. Dev2was unfamiliar with the project codebase. Each developer added three new metrics(bottom six rows in Table 4.1).Dev1 added the POM Files, Commit Author, and Commit Age metrics. Prior toadding these metrics Dev1 spent 30 minutes setting up the environment and explor-ing the code. The POM Files metric took 30 minutes to implement and requiredchanging 16 LoC5. Dev1 then spent 52 minutes and 48 minutes developing theCommit Author and Commit Age metrics, changing a similar amount of code foreach metric.Dev2 implemented three metrics based on some of the suggestions made bythe SE researchers in Section 5.2.3: Files Modified, Merge Indicator, and AuthorExperience. Prior to adding these metrics Dev2 spent 39 minutes setting up the5Note that these numbers are different from those listed in Table 4.1. See the closing paragraphin Section 5.3.1 for an explanation on this disparity.47environment and 40 minutes exploring the code. These metrics took 42, 44, and 26minutes to implement, respectively. All metrics required changing fewer than 30LoC.5.3.1 SummaryThe min / median / max times to implement the six metrics were 26 / 43 / 52minutes. These values compare favorably with the time that it would take to write acustom script to extract metric values from a repository, an alternative practiced byalmost all SE researchers in our user study. The key difference, however, is that byadding the new metric to RepoGrams the researcher gains two advantages: (1) theresulting project pattern for the metric can be juxtaposed against project patternsfor all of the other metrics already present in the tool, and (2) the researcher can useall of the existing interaction capabilities in RepoGrams (changing block lengths,zooming, etc).At the time of this case study, the architecture of the tool required that devel-opers modify existing source code files in order to add a new metrics. While thiscomplicated the process of adding a new metric, the experiment shows that devel-opers can do so in less than 1 hour after an initial code exploration. We attemptedto streamline this process even further by reworking the architecture of the tool tomove the implementation of metrics to separate files as described in Section 4.2.During this architectural change we had to rewrite parts of the existing metrics. Ta-ble 4.1 lists the LoC count after this change. We also added documentation to assistdevelopers in setting up their development environment and created examples thatdemonstrate how to add new metrics.48Chapter 6Future workIn this section we discuss plans for future work involving RepoGrams. Some ofthese are in response to current limitations of tool, while others are new ideas aimedto expand the reach of this tool beyond its current SE researchers focus.6.1 Additional featuresStudying populations of projects. RepoGrams requires the user to add one projectat a time. We are working to add support for importing random project samplesfrom GitHub. RepoGrams can be integrated with a large database of repositoriessuch as GHTorrent [31]. Users can then use a query language (such as SQL or aunique domain-specific language) to query by attributes that are recorded in thedatabase. e.g., filter to select random projects that use a particular programminglanguage, have a particular team size, specific range of activity in a period of time.By randomizing the selection based on strictly defined metrics such as these, SEresearcher can have a stronger claim of generalizability in their papers.Supporting custom metrics. SE researchers in our user study (Section 5.2)wanted more specialized metrics that were, unsurprisingly, related to their researchinterests. As mentioned in Section 5.2.3 we are working on a solution in whichspecific metrics can be customized in the front-end. These metrics will have pa-rameters that can be set by the users, and calculated by the server for display.For example, the POM Files metric is a specific case of a more generic metric49that counts the number of modified files in a commit that match a specific pattern(e.g., *.pom). We are also considering another solution in which a researchercould write a metric function in Python or a domain-specific language and submitit to the server through the browser. The server would integrate and use this user-defined metric to derive repository footprints. We plan to explore the challengesand benefits of this strategy.Supporting non-source-code historical information. RepoGrams currently sup-ports Git repositories. However, software projects may have bug trackers, mailinglists, Wikis, and other resources that it may be useful to study over time and com-pare with repository history in a RepoGrams interface. We plan to extend Re-poGrams with this information by integrating with the GitHub API, taking intoaccount concerns pointed out in prior work [13].Robust bucketing of metric values. Uniform bucket sizing currently imple-mented in RepoGrams has several issues. For example, a single outlier metricvalue can cause the first bucket to become so large as to include most other valuesexcept the outlier. One solution is to generate buckets based on different distribu-tions and to find outliers and place them in a special bucket. We will try differentconfigurations and algorithms for bucketing, as well as enabling real-time modifi-cations by users, in an attempt to solve this issue and other similar ones.Supporting collaborations. We added an import/export feature that saves thelocal state in a file, which can be shared and then loaded into the tool by others.This is a preliminary solution to the problem of sharing the data sets and visual-izations between different researchers working on the same project. We intend todesign and implement a more contemporary solution. e.g., sharing a link to thecurrent state instead of sharing files.6.2 Expanded audienceEducational study. We are exploring, together with other researchers in our depart-ment, the option of using RepoGrams for educational purposes. We are designingtwo experiments that involves integrating RepoGrams in a third year SE class inwhich student teams develop a software project for the entire term. In one exper-iment we are attempting to find correlation between specific repository footprints50and final grades given to student teams in previous terms when the class was taught.In the other experiment we will integrate RepoGrams into the periodic evaluationsof the student teams by the Teaching Assistants (TAs) during the term. In this ex-periment we are attempting to discover whether the visualizations shown by thetool help guide the student teams towards a more successful completion of theproject and to better understand the expectations of their TA.Use of RepoGrams in the industry. RepoGrams is designed for SE researchers.However, it is possible that it can be also used by other target audiences. One exam-ple is for managers or software developers in industry. They can use RepoGrams totrack project activity and potentially gain insights about the development process.6.3 Further evaluationsLong term benefits of RepoGrams for SE researchers. Our evaluation did not con-clusively show that RepoGrams helps SE researchers in selecting their evaluationtargets. We plan to use RepoGrams in our own SE research work and to collectanecdotal evidence from other researchers to be able to eventually argue this pointconclusively.Evaluating new features. We added several new features to RepoGrams thatwe did not evaluate. One example of such a feature is the logarithmic block lengthmode described in Section 4.1.1. This block length mode was added after the userstudy with SE researchers (Section 5.2) and thus was not evaluated.Another feature that we added to the tool was created while designing the abovementioned educational study. We found that many student groups perform largescale refactoring such as running a code formatter on specific commits or addinglarge third party libraries to their repositories. The commits blocks for these com-mits take up a sizable part of the repository footprint, yet they are of no interest inthis study. We implemented a feature to hide individual commits from the view.Evaluating whether it is a useful feature for the SE community remains future work.We intend to test this new feature and others as part of any future evaluationwhere they might prove relevant.51Chapter 7ConclusionThe widespread availability of open source repositories has had significant impacton SE research. It is now possible for an empirical study to consider hundreds ofprojects with thousands of commits, hundreds of authors, and millions of lines ofcode. Unfortunately, more is not necessarily better or easier. To properly selectevaluation targets for a research study the researcher must be highly aware of thefeatures of the projects that may influence the results. Our preliminary investigationof 55 published papers indicates that this process is frequently undocumented orhaphazard.To help with this issue we developed RepoGrams, a tool for analyzing andcomparing software repositories across multiple dimensions. The key idea is aflexible repository footprint abstraction that can compactly represent a variety ofuser-defined metrics to help characterize software projects over time. We eval-uated RepoGrams in two user studies and found that it helps researchers to an-swer advanced, open-ended, questions about the relative evolution of softwareprojects. RepoGrams is released as free software [53] and is made available onlineat http://repograms.net/.52Bibliography[1] AngularJS — Superheroic JavaScript MVW Framework.https://angularjs.org/. → pages 28[2] CherryPy — A Minimalist Python Web Framework. http://cherrypy.org/. →pages 28[3] Docker - Build, Ship, and Run Any App, Anywhere.https://www.docker.com/. → pages 28[4] Welcome to Apache Maven. http://maven.apache.org/. → pages 41[5] Flickr uploading tool for GNOME. https://github.com/GNOME/postr. →pages 4[6] Welcome to pygit2s documentation. http://www.pygit2.org/. → pages 28[7] DB Browser for SQLite project.https://github.com/sqlitebrowser/sqlitebrowser. → pages 4[8] Summarizing Software Artifacts. https://www.cs.ubc.ca/cs-research/software-practices-lab/projects/summarizing-software-artifacts. → pages 8[9] Alexa. github.com Site Overview. http://www.alexa.com/siteinfo/github.com.[Accessed Apr. 20, 2015]. → pages 1[10] A. Alipour, A. Hindle, and E. Stroulia. A Contextual Approach TowardsMore Accurate Duplicate Bug Report Detection. In Proceedings of the 10thWorking Conference on Mining Software Repositories, MSR ’13, pages183–192, Piscataway, NJ, USA, 2013. IEEE Press. ISBN978-1-4673-2936-1. URLhttp://dl.acm.org/citation.cfm?id=2487085.2487123. → pages 66, 6753[11] J. B. Begole, J. C. Tang, R. B. Smith, and N. Yankelovich. Work Rhythms:Analyzing Visualizations of Awareness Histories of Distributed Groups. InProceedings of the 2002 ACM Conference on Computer SupportedCooperative Work, CSCW ’02, pages 334–343, New York, NY, USA, 2002.ACM. ISBN 1-58113-560-2. doi:10.1145/587078.587125. URLhttp://doi.acm.org/10.1145/587078.587125. → pages 11[12] N. Bettenburg, M. Nagappan, and A. E. Hassan. Think Locally, ActGlobally: Improving Defect and Effort Prediction Models. In Proceedings ofthe 9th IEEE Working Conference on Mining Software Repositories, MSR’12, pages 60–69, Piscataway, NJ, USA, 2012. IEEE Press. ISBN978-1-4673-1761-0. URLhttp://dl.acm.org/citation.cfm?id=2664446.2664455. → pages 64[13] C. Bird, P. C. Rigby, E. T. Barr, D. J. Hamilton, D. M. German, andP. Devanbu. The Promises and Perils of Mining Git. In Proceedings of the2009 6th IEEE International Working Conference on Mining SoftwareRepositories, MSR ’09, pages 1–10, Washington, DC, USA, 2009. IEEEComputer Society. ISBN 978-1-4244-3493-0.doi:10.1109/MSR.2009.5069475. URLhttp://dx.doi.org/10.1109/MSR.2009.5069475. → pages 5, 31, 50[14] C. Bird, N. Nagappan, B. Murphy, H. Gall, and P. Devanbu. Don’t TouchMy Code! Examining the Effects of Ownership on Software Quality. InProceedings of the 19th ACM SIGSOFT Symposium and the 13th EuropeanConference on Foundations of Software Engineering, ESEC/FSE ’11, pages4–14, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0443-6.doi:10.1145/2025113.2025119. URLhttp://doi.acm.org/10.1145/2025113.2025119. → pages 42[15] T. F. Bissyande´, F. Thung, D. Lo, L. Jiang, and L. Re´veille`re. Orion: ASoftware Project Search Engine with Integrated Diverse Software Artifacts.In Proceedings of the 2013 18th International Conference on Engineering ofComplex Computer Systems, ICECCS ’13, pages 242–245, Washington, DC,USA, 2013. IEEE Computer Society. ISBN 978-0-7695-5007-7.doi:10.1109/ICECCS.2013.42. URLhttp://dx.doi.org/10.1109/ICECCS.2013.42. → pages 8[16] A. Borges, W. Ferreira, E. Barreiros, A. Almeida, L. Fonseca, E. Teixeira,D. Silva, A. Alencar, and S. Soares. Support Mechanisms to ConductEmpirical Studies in Software Engineering: A Systematic Mapping Study.In Proceedings of the 19th International Conference on Evaluation and54Assessment in Software Engineering, EASE ’15, pages 22:1–22:14, NewYork, NY, USA, 2015. ACM. ISBN 978-1-4503-3350-4.doi:10.1145/2745802.2745823. URLhttp://doi.acm.org/10.1145/2745802.2745823. → pages 9[17] K. Chen, P. Liu, and Y. Zhang. Achieving Accuracy and ScalabilitySimultaneously in Detecting Application Clones on Android Markets. InProceedings of the 36th International Conference on Software Engineering,ICSE 2014, pages 175–186, New York, NY, USA, 2014. ACM. ISBN978-1-4503-2756-5. doi:10.1145/2568225.2568286. URLhttp://doi.acm.org/10.1145/2568225.2568286. → pages 1[18] C. Collberg, S. Kobourov, J. Nagra, J. Pitts, and K. Wampler. A System forGraph-based Visualization of the Evolution of Software. In Proceedings ofthe 2003 ACM Symposium on Software Visualization, SoftVis ’03, pages77–ff, New York, NY, USA, 2003. ACM. ISBN 1-58113-642-0.doi:10.1145/774833.774844. URLhttp://doi.acm.org/10.1145/774833.774844. → pages 10[19] M. D’Ambros, M. Lanza, and H. Gall. Fractal Figures: VisualizingDevelopment Effort for CVS Entities. In Proceedings of the 3rd IEEEInternational Workshop on Visualizing Software for Understanding andAnalysis, VISSOFT ’05, pages 16–, Washington, DC, USA, 2005. IEEEComputer Society. ISBN 0-7803-9540-9.doi:10.1109/VISSOF.2005.1684303. URLhttp://dx.doi.org/10.1109/VISSOF.2005.1684303. → pages 11[20] M. D’Ambros, H. Gall, M. Lanza, and M. Pinzger. Analysing SoftwareRepositories to Understand Software Evolution. In Software Evolution,pages 37–67. Springer Berlin Heidelberg, 2008. ISBN 978-3-540-76439-7.doi:10.1007/978-3-540-76440-3 3. URLhttp://dx.doi.org/10.1007/978-3-540-76440-3 3. → pages 10[21] R. M. de Mello, P. C. da Silva, P. Runeson, and G. H. Travassos. Towards aFramework to Support Large Scale Sampling in Software EngineeringSurveys. In Proceedings of the 8th ACM/IEEE International Symposium onEmpirical Software Engineering and Measurement, ESEM ’14, pages48:1–48:4, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2774-9.doi:10.1145/2652524.2652567. URLhttp://doi.acm.org/10.1145/2652524.2652567. → pages 955[22] A. Delater and B. Paech. Tracing Requirements and Source Code duringSoftware Development: An Empirical Study. In Empirical SoftwareEngineering and Measurement, 2013 ACM / IEEE International Symposiumon, pages 25–34. IEEE, Oct 2013. doi:10.1109/ESEM.2013.16. → pages 18[23] S. Diehl. Software Visualization: Visualizing the Structure, Behaviour, andEvolution of Software. Springer, 2010. ISBN 3642079857, 9783642079856.→ pages 10[24] R. Dyer, H. A. Nguyen, H. Rajan, and T. N. Nguyen. Boa: A Language andInfrastructure for Analyzing Ultra-large-scale Software Repositories. InProceedings of the 2013 International Conference on Software Engineering,ICSE ’13, pages 422–431, Piscataway, NJ, USA, 2013. IEEE Press. ISBN978-1-4673-3076-3. URLhttp://dl.acm.org/citation.cfm?id=2486788.2486844. → pages 8[25] S. G. Eick, J. L. Steffen, and E. E. Sumner, Jr. Seesoft-A Tool forVisualizing Line Oriented Software Statistics. IEEE Trans. Softw. Eng., 18(11):957–968, Nov. 1992. ISSN 0098-5589. doi:10.1109/32.177365. URLhttp://dx.doi.org/10.1109/32.177365. → pages 11[26] Free Software Foundation. GNU General Public License, Version 3.https://www.gnu.org/copyleft/gpl.html. → pages 110[27] G. Ghezzi and H. C. Gall. Replicating Mining Studies with SOFAS. InProceedings of the 10th Working Conference on Mining SoftwareRepositories, MSR ’13, pages 363–372, Piscataway, NJ, USA, 2013. IEEEPress. ISBN 978-1-4673-2936-1. URLhttp://dl.acm.org/citation.cfm?id=2487085.2487152. → pages 9[28] T. Girba, A. Kuhn, M. Seeberger, and S. Ducasse. How Developers DriveSoftware Evolution. In Proceedings of the Eighth International Workshopon Principles of Software Evolution, IWPSE ’05, pages 113–122,Washington, DC, USA, 2005. IEEE Computer Society. ISBN0-7695-2349-8. doi:10.1109/IWPSE.2005.21. URLhttp://dx.doi.org/10.1109/IWPSE.2005.21. → pages 10[29] A. Gokhale, V. Ganapathy, and Y. Padmanaban. Inferring Likely MappingsBetween APIs. In Proceedings of the 2013 International Conference onSoftware Engineering, ICSE ’13, pages 82–91, Piscataway, NJ, USA, 2013.IEEE Press. ISBN 978-1-4673-3076-3. URLhttp://dl.acm.org/citation.cfm?id=2486788.2486800. → pages 6656[30] G. Gousios. The GHTorent Dataset and Tool Suite. In Proceedings of the10th Working Conference on Mining Software Repositories, MSR ’13, pages233–236, Piscataway, NJ, USA, 2013. IEEE Press. ISBN978-1-4673-2936-1. URLhttp://dl.acm.org/citation.cfm?id=2487085.2487132. → pages 8, 20[31] G. Gousios, B. Vasilescu, A. Serebrenik, and A. Zaidman. Lean GHTorrent:GitHub Data on Demand. In Proceedings of the 11th Working Conferenceon Mining Software Repositories, MSR 2014, pages 384–387, New York,NY, USA, 2014. ACM. ISBN 978-1-4503-2863-0.doi:10.1145/2597073.2597126. URLhttp://doi.acm.org/10.1145/2597073.2597126. → pages 8, 49[32] V. T. Heikkila, M. Paasivaara, and C. Lassenius. Scrumbut, but does itmatter? A mixed-method study of the planning process of a multi-teamscrum organization. In Empirical Software Engineering and Measurement,2013 ACM/IEEE International Symposium on, pages 85–94. IEEE, 2013. →pages 16[33] H. Hemmati, S. Nadi, O. Baysal, O. Kononenko, W. Wang, R. Holmes, andM. W. Godfrey. The MSR Cookbook: Mining a Decade of Research. InProceedings of the 10th Working Conference on Mining SoftwareRepositories, MSR ’13, pages 343–352, Piscataway, NJ, USA, 2013. IEEEPress. ISBN 978-1-4673-2936-1. URLhttp://dl.acm.org/citation.cfm?id=2487085.2487150. → pages 9[34] C. Iacob and R. Harrison. Retrieving and Analyzing Mobile Apps FeatureRequests from Online Reviews. In Proceedings of the 10th WorkingConference on Mining Software Repositories, MSR ’13, pages 41–44,Piscataway, NJ, USA, 2013. IEEE Press. ISBN 978-1-4673-2936-1. URLhttp://dl.acm.org/citation.cfm?id=2487085.2487094. → pages 67[35] A. Jedlitschka and D. Pfahl. Reporting guidelines for controlled experimentsin software engineering. In Empirical Software Engineering, 2005. 2005International Symposium on, pages 10–pp. IEEE, Nov 2005.doi:10.1109/ISESE.2005.1541818. → pages 9[36] R. Just, D. Jalali, and M. D. Ernst. Defects4J: A Database of Existing Faultsto Enable Controlled Testing Studies for Java Programs. In Proceedings ofthe 2014 International Symposium on Software Testing and Analysis, ISSTA2014, pages 437–440, New York, NY, USA, 2014. ACM. ISBN57978-1-4503-2645-2. doi:10.1145/2610384.2628055. URLhttp://doi.acm.org/10.1145/2610384.2628055. → pages 8[37] T. Kwon and Z. Su. Detecting and Analyzing Insecure Component Usage.In Proceedings of the ACM SIGSOFT 20th International Symposium on theFoundations of Software Engineering, FSE ’12, pages 5:1–5:11, New York,NY, USA, 2012. ACM. ISBN 978-1-4503-1614-9.doi:10.1145/2393596.2393599. URLhttp://doi.acm.org/10.1145/2393596.2393599. → pages 2, 15[38] M. Lanza. The Evolution Matrix: Recovering Software Evolution UsingSoftware Visualization Techniques. In Proceedings of the 4th InternationalWorkshop on Principles of Software Evolution, IWPSE ’01, pages 37–42,New York, NY, USA, 2001. ACM. ISBN 1-58113-508-4.doi:10.1145/602461.602467. URLhttp://doi.acm.org/10.1145/602461.602467. → pages 10[39] M. Lungu, M. Lanza, T. Gıˆrba, and R. Robbes. The Small ProjectObservatory: Visualizing Software Ecosystems. Sci. Comput. Program., 75(4):264–275, Apr. 2010. ISSN 0167-6423. doi:10.1016/j.scico.2009.09.004.URL http://dx.doi.org/10.1016/j.scico.2009.09.004. → pages 10[40] S. Mani, R. Catherine, V. S. Sinha, and A. Dubey. AUSUM: Approach forUnsupervised Bug Report Summarization. In Proceedings of the ACMSIGSOFT 20th International Symposium on the Foundations of SoftwareEngineering, FSE ’12, pages 11:1–11:11, New York, NY, USA, 2012. ACM.ISBN 978-1-4503-1614-9. doi:10.1145/2393596.2393607. URLhttp://doi.acm.org/10.1145/2393596.2393607. → pages 2, 18[41] T. Mens and S. Demeyer. Future Trends in Software Evolution Metrics. InProceedings of the 4th International Workshop on Principles of SoftwareEvolution, IWPSE ’01, pages 83–86, New York, NY, USA, 2001. ACM.ISBN 1-58113-508-4. doi:10.1145/602461.602476. URLhttp://doi.acm.org/10.1145/602461.602476. → pages 10[42] C. Metz. How GitHub Conquered Google, Microsoft, and Everyone Else.http://www.wired.com/2015/03/github-conquered-google-microsoft-everyone-else/. → pages 1[43] T. Munzner. Visualization Analysis and Design. CRC Press, 2014. → pages4, 2658[44] S. Nadi, C. Dietrich, R. Tartler, R. C. Holt, and D. Lohmann. LinuxVariability Anomalies: What Causes Them and How Do They Get Fixed? InProceedings of the 10th Working Conference on Mining SoftwareRepositories, MSR ’13, pages 111–120, Piscataway, NJ, USA, 2013. IEEEPress. ISBN 978-1-4673-2936-1. URLhttp://dl.acm.org/citation.cfm?id=2487085.2487112. → pages 66[45] M. Nagappan, T. Zimmermann, and C. Bird. Diversity in SoftwareEngineering Research. In Proceedings of the 2013 9th Joint Meeting onFoundations of Software Engineering, ESEC/FSE 2013, pages 466–476,New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2237-9.doi:10.1145/2491411.2491415. URLhttp://doi.acm.org/10.1145/2491411.2491415. → pages 2, 8[46] S. Neu. Telling Evolutionary Stories with Complicity. PhD thesis, Citeseer,2011. → pages 10[47] R. Nokhbeh Zaeem and S. Khurshid. Test Input Generation Using DynamicProgramming. In Proceedings of the ACM SIGSOFT 20th InternationalSymposium on the Foundations of Software Engineering, FSE ’12, pages34:1–34:11, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1614-9.doi:10.1145/2393596.2393635. URLhttp://doi.acm.org/10.1145/2393596.2393635. → pages 66, 67[48] M. Pinzger, H. Gall, M. Fischer, and M. Lanza. Visualizing MultipleEvolution Metrics. In Proceedings of the 2005 ACM Symposium on SoftwareVisualization, SoftVis ’05, pages 67–75, New York, NY, USA, 2005. ACM.ISBN 1-59593-073-6. doi:10.1145/1056018.1056027. URLhttp://doi.acm.org/10.1145/1056018.1056027. → pages 10[49] D. Posnett, P. Devanbu, and V. Filkov. MIC check: a correlation tactic forESE data. In Proceedings of the 9th IEEE Working Conference on MiningSoftware Repositories, pages 22–31. IEEE Press, 2012. → pages 15, 66[50] T. Proebsting and A. M. Warren. Repeatability and Benefaction in ComputerSystems Research. 2015. → pages 9[51] S. Rastkar, G. C. Murphy, and G. Murray. Summarizing Software Artifacts:A Case Study of Bug Reports. In Proceedings of the 32Nd ACM/IEEEInternational Conference on Software Engineering - Volume 1, ICSE ’10,pages 505–514, New York, NY, USA, 2010. ACM. ISBN978-1-60558-719-6. doi:10.1145/1806799.1806872. URLhttp://doi.acm.org/10.1145/1806799.1806872. → pages 859[52] B. Ray, D. Posnett, V. Filkov, and P. Devanbu. A Large Scale Study ofProgramming Languages and Code Quality in Github. In Proceedings of the22Nd ACM SIGSOFT International Symposium on Foundations of SoftwareEngineering, FSE 2014, pages 155–165, New York, NY, USA, 2014. ACM.ISBN 978-1-4503-3056-5. doi:10.1145/2635868.2635922. URLhttp://doi.acm.org/10.1145/2635868.2635922. → pages 1[53] D. Rozenberg, V. Poser, H. Becker, F. Kosmale, S. Becking, S. Grant,M. Maas, M. Jose, and I. Beschastnikh. RepoGrams.https://github.com/RepoGrams/RepoGrams. → pages 52, 110[54] F. Servant and J. A. Jones. History Slicing: Assisting Code-evolution Tasks.In Proceedings of the ACM SIGSOFT 20th International Symposium on theFoundations of Software Engineering, FSE ’12, pages 43:1–43:11, NewYork, NY, USA, 2012. ACM. ISBN 978-1-4503-1614-9.doi:10.1145/2393596.2393646. URLhttp://doi.acm.org/10.1145/2393596.2393646. → pages 10[55] J. Siegmund, N. Siegmund, and S. Apel. Views on internal and externalvalidity in empirical software engineering. In Proceedings of the 37thInternational Conference on Software Engineering, ICSE 2015, 2015. →pages 9[56] F. Sokol, M. Finavaro Aniche, and M. Gerosa. MetricMiner: Supportingresearchers in mining software repositories. In Source Code Analysis andManipulation (SCAM), 2013 IEEE 13th International Working Conferenceon, pages 142–146, Sept 2013. doi:10.1109/SCAM.2013.6648195. → pages8[57] M.-A. D. Storey, D. Cˇubranic´, and D. M. German. On the Use ofVisualization to Support Awareness of Human Activities in SoftwareDevelopment: A Survey and a Framework. In Proceedings of the 2005 ACMSymposium on Software Visualization, SoftVis ’05, pages 193–202, NewYork, NY, USA, 2005. ACM. ISBN 1-59593-073-6.doi:10.1145/1056018.1056045. URLhttp://doi.acm.org/10.1145/1056018.1056045. → pages 10[58] A. Strauss and J. Corbin. Basics of qualitative research: Techniques andprocedures for developing grounded theory. September 1998. → pages 12[59] C. M. B. Taylor and M. Munro. Revision Towers. In Proceedings of the 1stInternational Workshop on Visualizing Software for Understanding and60Analysis, VISSOFT ’02, pages 43–50, Washington, DC, USA, 2002. IEEEComputer Society. ISBN 0-7695-1662-9. URLhttp://dl.acm.org/citation.cfm?id=832270.833810. → pages 10[60] E. Tempero, C. Anslow, J. Dietrich, T. Han, J. Li, M. Lumpe, H. Melton, andJ. Noble. The Qualitas Corpus: A Curated Collection of Java Code forEmpirical Studies. In Proceedings of the 2010 Asia Pacific SoftwareEngineering Conference, APSEC ’10, pages 336–345, Washington, DC,USA, 2010. IEEE Computer Society. ISBN 978-0-7695-4266-9.doi:10.1109/APSEC.2010.46. URLhttp://dx.doi.org/10.1109/APSEC.2010.46. → pages 8[61] C. Treude and M.-A. Storey. Work Item Tagging: Communicating Concernsin Collaborative Software Development. IEEE Trans. Softw. Eng., 38(1):19–34, Jan. 2012. ISSN 0098-5589. doi:10.1109/TSE.2010.91. URLhttp://dx.doi.org/10.1109/TSE.2010.91. → pages 11[62] J. Tsay, L. Dabbish, and J. Herbsleb. Let’s Talk About It: EvaluatingContributions Through Discussion in GitHub. In Proceedings of the 22NdACM SIGSOFT International Symposium on Foundations of SoftwareEngineering, FSE 2014, pages 144–154, New York, NY, USA, 2014. ACM.ISBN 978-1-4503-3056-5. doi:10.1145/2635868.2635882. URLhttp://doi.acm.org/10.1145/2635868.2635882. → pages 16[63] F. B. Vie´gas, M. Wattenberg, and K. Dave. Studying Cooperation andConflict Between Authors with History Flow Visualizations. In Proceedingsof the SIGCHI Conference on Human Factors in Computing Systems, CHI’04, pages 575–582, New York, NY, USA, 2004. ACM. ISBN1-58113-702-8. doi:10.1145/985692.985765. URLhttp://doi.acm.org/10.1145/985692.985765. → pages 11[64] J. Warner. Top 100 Most Popular Languages on Github.https://jaxbot.me/articles/github-most-popular-languages, July 2014. →pages 39[65] M. Wattenberg, F. B. Vie´gas, and K. Hollenbach. Visualizing Activity onWikipedia with Chromograms. In Proceedings of the 11th IFIP TC 13International Conference on Human-computer Interaction - Volume Part II,INTERACT’07, pages 272–287. Springer-Verlag, Berlin, Heidelberg, 2007.ISBN 3-540-74799-0, 978-3-540-74799-4. URLhttp://dl.acm.org/citation.cfm?id=1778331.1778361. → pages 1161[66] J. Wu, R. C. Holt, and A. E. Hassan. Exploring Software Evolution UsingSpectrographs. In Proceedings of the 11th Working Conference on ReverseEngineering, WCRE ’04, pages 80–89, Washington, DC, USA, 2004. IEEEComputer Society. ISBN 0-7695-2243-2. URLhttp://dl.acm.org/citation.cfm?id=1038267.1039040. → pages 10[67] S. Xie, F. Khomh, and Y. Zou. An Empirical Study of the Fault-proneness ofClone Mutation and Clone Migration. In Proceedings of the 10th WorkingConference on Mining Software Repositories, MSR ’13, pages 149–158,Piscataway, NJ, USA, 2013. IEEE Press. ISBN 978-1-4673-2936-1. URLhttp://dl.acm.org/citation.cfm?id=2487085.2487118. → pages 16, 18, 66[68] J. Yang and L. Tan. Inferring semantically related words from softwarecontext. In Proceedings of the 9th IEEE Working Conference on MiningSoftware Repositories, pages 161–170. IEEE Press, 2012. → pages 1562Appendix ALiterature surveyThis appendix contains meta-data and raw results for the literature survey describedin Chapter 3.A.1 Full protocolThis is version 6 of the protocol, which we evolved during the coding process.A.1.1 ScopeOur study considers a paper to be in scope if it describe evaluation targets thatmatch our definition.A.1.2 Overview1. Categorize each assigned paper along 5 dimensions. Along the way, youmay need to2. expand the codebook to accommodate previously unobserved cases. The 5dimensions:(a) code: selection criteria(b) code: projects visibility(c) yes/no: does the paper analyzes some feature of the projects over time(d) keywords: data used in the evaluation63(e) number: number of evaluation targetsA.1.3 ProcedureRead the abstract. Usually the abstract mentions whether the paper evaluates atool (on rare occasions it will not be mentioned in the abstract but will be in theintroduction or conclusion).Scan the paper to find the section describing the evaluation (usually titled Eval-uation or Methodology, but can have another name). Once you found the name(s)of the project(s1) that are being evaluated, search for all mentions of those namesand look for a paragraph that explains the reasons for selecting those projects. Usu-ally it will contain the key phrases “we selected X because Y” or “our reasons forselecting X are Y”.Familiarize yourself with all the codes, apply the one that matches best. Somepapers can have two or more selection criteria codes or project visibility codesapply to them. Reasons for that might be:• The selection criteria is ambiguous• There are two sets of projects (e.g., creating a cross-project prediction modelfor software defects)• It is clear that the selection process had all of the codes apply• Example: “All datasets used in our case study have been obtained from thePROMISE repository, and have been reported to be as diverse of datasets ascan be found in this database” [12] — REF and DIVSome papers clearly do not evaluate software. e.g., papers that review or cri-tique previous papers, papers that only conduct a series of interviews, etc. In thiscase apply IRR for selection criteria only.Some papers have a detailed explanation on their selection criteria in the threatsto validity section. Make sure to read this section as well.1Some papers do not mention the project by name (e.g., IND paper that does not reveal theindustrial partner) in which case they would usually give the project a pseudonym or call it “our casestudy”, “the studied program”, etc.64A.2 CategoriesThe following codes apply solely to the main evaluation(s) of the paper. They donot include preliminary works.• Selection criteria codes are listed in Table 3.2.Disambiguation– DEV requires that the selected project(s) have a specific developmentprocess, either followed by developers or related to some automatedtool. The development process is mentioned explicitly as a one of thereasons that this project was chosen or as a requirement for the tool tooperate. This does not necessarily have to be a unique feature. It couldbe something common, such as the existence of certain data sets, usageof various aid tools that relate to each other such as an issue tracker thatintegrates with version control, etc.– QUA and MET differ in strictness. They both require that the selec-tion criteria is somewhat indexable: codebase size, age, programminglanguages, team composition, popularity, program domain, etc. Thedifference is that QUA is not well-defined, there is no “function” that,given a project, returns a yes/no answer to whether or not this projectfits the selection criteria. MET is more deterministic — either a projectfits the criteria, or it does not.– DIV is blurry — it may be difficult to tell if the authors are charac-terizing the projects they selected or if they used diversity as a criteriaduring project selection. Therefore, consider whether diversity is men-tioned in the vicinity of selection methodology and whether it is likelythat it was a selection criteria.– ACC is not always added when an industrial project is studied. Whenthere is no clear reason, other than the fact that they had access, ACCcode should not be used. If an equivalent analysis was applied to anindustrial and an open source project then ACC should not be used.Note that the IND visibility code (see Table 3.3) can still apply, even ifACC is not used.65• Project visibility codes are listed in Table 3.3.• Analyzes some features of the project over time (evolution)– Yes: some aspect of the analysis studies a feature of the projects’ overtime. e.g., comparing two or more releases, reviewing commit logs,inspecting bug types over time.Example: [49] — this paper uses the projects’ commit logs and bughistory in its evaluation.– No: all aspects of the analysis make use of a single snapshot of eachproject. e.g., running a tool on one version of the project’s source code,comparing bug types across projects but not across time.Example: [47] — this paper uses the projects’ source code from a sin-gle snapshot in its evaluation.• Evaluated artifacts keywordsWrite in keywords that describe the evaluation targets’ artifacts from used inthe paper. Examples:– [29]: “runtime traces”– [44]: “patches”– [67]: “code clones”– [10]: “bug reports”• What is a valid evaluation targetA software project. For example, a codebase that evolves over time withmultiple collaborators. Not, for example, an abstract model or an algorithm.• Number of evaluation targetsThe number of evaluation targets that the paper uses. This is a subjectivenumber, as some targets can be thought of as 1 project or many projects(e.g., Android is an operating system with many sub-projects: One papercan evaluate Android as a single target, while another paper can evaluate themany sub-projects in Android).66Use the following rules of thumb, which are ordered in decreasing prece-dence (initial ones take precedence):1. If a number is explicitly mentioned, use that number.Example of 161 targets: “Out of the 169 apps randomly selected, 8apps had no reviews assigned to them which left us with 161 reviewedapps” [34]2. For multi-project targets, look for whether a multi-project is evaluatedas a single target or as multiple targets.Example of 8 targets: [47] — this paper names 3 targets (“Microbench-marks”, “Google Chrome”, and “Apple Safari”), but in various tablesand in the text the Microbenchmarks are being evaluated as 6 discretetargets. The total number is therefore 8: 6 microbenchmarks + 2 namedapplications.3. If a number is not explicitly mentioned but the authors list names ofprojects and treat each project as a single target in their evaluation,count the names.Example of 1 target: “We evaluate our approach on a large bug-reportdata-set from the Android project, which is a Linux-based operatingsystem with several sub-projects” [10]A.2.1 NotesMultiple selection codes may indicate a number of scenarios. For example, a papermight have selected two sets of projects independently (e.g., ACC for industrialMicrosoft projects and REF for open source projects based on prior work). Thetwo selection codes may also indicate a kind of filtering (e.g., REF for selectingbenchmarks from prior work and QUA to filter these benchmarks down to a subsetused in the paper).A.3 Raw resultsHere we list the raw results from the literature survey.67Table A.1: Results on the initial set of 59 papers used to seed the codebook.Title Selectioncode [1]# ofevaluationtargetsMSR 2014Mining energy-greedy API usage patterns in Android apps: an empirical study UNK 55GreenMiner: a hardware based mining software repositories software energy consumption framework SPE 1Mining questions about software energy consumption IRRPrediction and ranking of co-change candidates for clones QUA 6Incremental origin analysis of source code files QUA 7Oops! where did that code snippet come from? SPE 1Works for me! characterizing non-reproducible bug reports QUA 6Characterizing and predicting blocking bugs in open source projects QUA 6An empirical study of dormant bugs SPE 20The promises and perils of mining GitHub IRRMining StackOverflow to turn the IDE into a self-confident programming prompter CON 2Mining questions asked by web developers IRRProcess mining multiple repositories for software defect resolution from control and organizational perspective SPE 1MUX: algorithm selection for software model checkers REF 79Improving the effectiveness of test suite through mining historical data IND 1Finding patterns in static analysis alerts: improving actionable alert ranking QUA 3Impact analysis of change requests on source code based on interaction and commit histories QUA 1An empirical study of just-in-time defect prediction using cross-project models REF, QUA 11Towards building a universal defect prediction model POP, REF 1403The impact of code review coverage and code review participation on software quality: a case study of the qt, VTK, and ITK projects MET 3Modern code reviews in open-source projects: which problems do they fix QUA 2Thesaurus-based automatic query expansion for interface-driven code search REF 100Estimating development effort in Free/Open source software projects by mining software repositories: a case study of OpenStack QUA 1An industrial case study of automatically identifying performance regression-causes IND, REF 2Revisiting Android reuse studies in the context of code obfuscation and library usages POP 24379Syntax errors just aren't natural: improving error reporting with language models UNK 3Do developers feel emotions? an exploratory analysis of emotions in software artifacts REF 117How does a typical tutorial for mobile development look like? IRRUnsupervised discovery of intentional process models from event logs SPE 1ICSE2014Cowboys, ankle sprains, and keepers of quality: how is video game development different from software development? IRRAnalyze this! 145 questions for data scientists in software engineering IRRThe dimensions of software engineering success IRRHow do professionals perceive legacy systems and software modernization? IRRSimRT: an automated framework to support regression testing for data races QUA 5Performance regression testing target prioritization via performance risk analysis QUA 3Code coverage for suite evaluation by developers MET 1254Time pressure: a controlled experiment of test case development and requirements review IRRVerifying component and connector models against crosscutting structural views UNK 4TradeMaker: automated dynamic analysis of synthesized tradespaces REF, CON 4Lifting model transformations to product lines IRRAutomated goal operationalisation based on interpolation and SAT solving IRRMining configuration constraints: static analyses and empirical results QUA 4Which configuration option should I change? QUA 8Detecting differences across multiple instances of code clones UNK 3Achieving accuracy and scalability simultaneously in detecting application clones on Android markets POP 150145Two's company, three's a crowd: a case study of crowdsourcing software development DES, IND 1Does latitude hurt while longitude kills? geographical and temporal separation in a large scale software development project DES, IND 1Software engineering at the speed of light: how developers stay current using twitter IRRBuilding it together: synchronous development in OSS REF, QUA, SPE 31A critical review of "automatic patch generation learned from human-written patches": essay on the problem statement and the evaluation of automatic software rep IRRData-guided repair of selection statements QUA 7The strength of random search on automated program repair REF 7MintHint: automated synthesis of repair hints MET 3Mining behavior models from user-intensive web applications IND 1Reviser: efficiently updating IDE-/IFDS-based data-flow analyses in response to incremental program changes DES, UNK 4Automated design of self-adaptive software with control-theoretical formal guarantees QUA 3Perturbation analysis of stochastic systems with empirical distribution parameters IRRHow do centralized and distributed version control systems impact software changes? MET 132Transition from centralized to decentralized version control systems: a case study on reasons, barriers, and outcomes IRR• SPE (“SPEcial development process required”) was renamed to DEV (“somequality of the DEVelopment practice required”)68• CON and IND, which were originally “selection process” codes, were usedto create the “project visibility” category• POP (“A complete set or random subset of projects from an explicit popula-tion of repositories (such as GitHub, an app store, etc.)”) and MET (“randomor manual selection based on a set of well-defined METrics”) was removedfrom the final version of the codebook as no papers in the main literaturesurvey were categorized using these codes. An extended literature surveymight reveal such papers, in which case these codes can be re-added to thecodebook.• DES (“Evaluated on a project that the tool is designed for, or a case studyperformed on specific projects (no tool)”) was removed and replaced by otherrationales where appropriate69Table A.2: Results and analysis of the survey of 55 paper.Title Selection code Visibility code Analyzes evolution Evaluated data type keywords # of evaluation tICSE2013Robust reconfigurations of component assemblies IRRCoupling software architecture and human architecture for collaboration-aware s IRRInferring likely mappings between APIs QUA PUB No runtime traces 21Creating a shared understanding of testing culture on a social coding site IRRHuman performance regression testing QUA PUB No User performance times 1Teaching and learning programming and software engineering via interactive ga IRRUML in practice IRRAgility at scale: economic governance, measured improvement, and disciplined IRRReducing human effort and improving quality in peer code reviews using automa ACC,DEV IND No review requests, commits 2 or 3Improving feature location practice with multi-faceted interactive exploration REF,QUA PUB No source code, features 1MSR2013Which work-item updates need your response? DEV PUB,IND Yes work items 2Linux variability anomalies: what causes them and how do they get fixed? DEV PUB Yes Patches 1An empirical study of the fault-proneness of clone mutation and clone migration QUA,DIV PUB Yes code clones 3A contextual approach towards more accurate duplicate bug report detection DEV,QUA PUB Yes bug reports 1Why so complicated? simple term filtering and weighting for location-based bug DEV,QUA PUB No bug reports, source code, commits 2The impact of tangled code changes DEV,QUA PUB No bug reports, commits 5Replicating mining studies with SOFAS IRRBug report assignee recommendation using activity profiles QUA,DIV PUB No bug reports 3Bug resolution catalysts: identifying essential non-committers from bug repositori DEV,DIV PUB,IND No bug reports, commits 16Discovering, reporting, and fixing performance bugs DEV,REF PUB No bug reports, patches 3FSE2014Verifying CTL-live properties of infinite state models using an SMT solver IRRLet's talk about it: evaluating contributions through discussion in GitHub REF,MET PUB Yes pull requests, comments ?Detecting energy bugs and hotspots in mobile apps DIV PUB No executables 30Selection and presentation practices for code example summarization DEV PUB No code fragments 1Vector abstraction and concretization for scalable detection of refactorings REF,QUA PUB Yes source code, commits 203Focus-shifting patterns of OSS developers and their congruence with call graphs QUA PUB Yes commits 15Building call graphs for embedded client-side code in dynamic web applications REF PUB No source code 5JSAI: a static analysis platform for JavaScript REF,DIV PUB No source code 28Sherlock: scalable deadlock detection for concurrent programs REF,DIV PUB No source code 22Sketches and diagrams in practice IRRFSE2012Detecting and analyzing insecure component usage QUA PUB No components, security policies 6Do crosscutting concerns cause modularity problems? DEV,QUA PUB Yes bug reports, patches, reviews 1AUSUM: approach for unsupervised bug report summarization REF,UNK PUB,IND No bug reports 2Test input generation using dynamic programming REF,QUA PUB No source code 8Mining the execution history of a software system to infer the best time for its ad UNK UNK No event log 1CarFast: achieving higher statement coverage faster IRRMulti-layered approach for recovering links between bug reports and fixes REF PUB Yes bug reports, commits 3Understanding myths and realities of test-suite evolution DEV,QUA PUB Yes test suites 6Searching connected API subgraph via text phrases IRRRubicon: bounded verification of web applications DEV,QUA,DIV PUB No source code, specs 5MSR2012MIC check: A correlation tactic for ESE data DEV PUB Yes bug reports, commit logs 4Think locally, act globally: Improving defect and effort prediction models REF,DIV PUB Yes defects 4Analysis of customer satisfaction survey data IRRInferring semantically related words from software context REF PUB No source code 7A qualitative study on performance bugs DEV,QUA PUB No bug reports 2ASE2013Improving efficiency of dynamic analysis with dynamic dependence summaries REF PUB Yes source code 6Bita: Coverage-guided, automatic testing of actor programs REF,QUA PUB,CON No source code 8Ranger: Parallel analysis of alloy models by range partitioning IRRJFlow: Practical refactorings for flow-based parallelism REF,DIV PUB No source code 7SEDGE: Symbolic example data generation for dataflow programs REF,QUA PUB No source code 31ESEM2013Tracing Requirements and Source Code during Software Development: An Emp DEV CON Yes requirements, work items, source cod 3When a Patch Goes Bad: Exploring the Properties of Vulnerability-Contributing REF,DEV PUB Yes commits, source code, vulnerabilities 1ScrumBut, But Does it Matter? A Mixed-Method Study of the Planning Process o DEV,ACC IND Yes requirements 1Using Ensembles for Web Effort Estimation IRRExperimental Comparison of Two Safety Analysis Methods and Its Replication IRR70Appendix BUndergraduate students studyThis appendix contains meta-data and raw results for the user study with under-graduate students described in Section 5.1.B.1 Slides from the in-class demonstrationA tool to analyze and juxtapose software project historyUniversity of British ColumbiaComputer ScienceSoftware Practices LabSaarland UniversityComputer ScienceBrief lecture and	in-class research study71                        University of  British Columbia     Ivan Beschastnikh                                                                                                                                                                                                                                                   Another tool to visualize repositories?• There are numerous tool to visualize repositories?• None provide a flexible interface to?• Juxtapose/compare multiple repositories?• Unify multiple metrics of a repository into one view?• Simple and easy to use?• Our targeted population: SE researchers?• Need to select evaluation targets for studies?• Need a simple and efficient project comparison tool2                        University of  British Columbia     Ivan Beschastnikh                                                                                                                                                                                                                                                   Repograms: a repository is a sequence of blocks• A block represents a commit?• Block’s length is either a fixed constant or encodes lines of codes changed?• A block’s colour represents a “metric” value?• A metric is a function:?• Example (block length = fixed constant):3m(commit) ! numberTimeFirst commit…Project72                        University of  British Columbia     Ivan Beschastnikh                                                                                                                                                                                                                                                   Repograms: a repository is a sequence of blocks• A block represents a commit?• Block’s length is either a fixed constant or encodes lines of codes changed?• A block’s colour represents a “metric” value?• A metric is a function:?• Example (block length = lines of code changed):4m(commit) ! numberTimeFirst commit Big commit…Project                        University of  British Columbia     Ivan Beschastnikh                                                                                                                                                                                                                                                   Repograms: a repository is a sequence of blocks• A block represents a commit?• Block’s length is either fixed constant or encodes lines of codes changed?• A block’s colour represents a “metric” value?• Example metric: “number of words in a commit message”5TimeShort message Longer message…73                        University of  British Columbia     Ivan Beschastnikh                                                                                                                                                                                                                                                   DEMO!• Basic tool features6TimeShort message Longer message…                        University of  British Columbia     Ivan Beschastnikh                                                                                                                                                                                                                                                   Evaluating repograms• User-study?• Study design: survey + tool use             (you’ll experience this firsthand!)?• Human subjects review7TimeShort message Longer message…74                        University of  British Columbia     Ivan Beschastnikh                                                                                                                                                                                                                                                   Human subjects review (REB)8• REB: research ethics board?• Independent body?• Reviews research protocol + study materials?• Goal is to maximize human safety: protect human subjects from physical or psychological harm?• Risk-benefit analysis                        University of  British Columbia     Ivan Beschastnikh                                                                                                                                                                                                                                                   Repograms REB application9• PDFs75In-class research studyHelp us evaluate                  !• You are the subjects	• Voluntary (you don’t have to participate)	• Can do the study at home (anytime this week)	• Enter raffle for 5 x $25 gift cards to UBC Bookstore Begin study by browsing to:Questions? Raise your hand.http://repograms.netBrowser options: Chrome (best), Firefox, or Safari (worst)	                          Does not work in IE	Avoid tablets and phones76B.2 Protocol and questionnaireB.2.1 OverviewThe study was conducted in a 4th year software engineering class at the Univer-sity of British Columbia. Prior to the study there were two lectures by the leadinginvestigator. The first lecture covered concepts in version control systems and re-search methods. The second lecture was an introduction to RepoGrams. The slidesfrom the latter lecture are presented in the preceding section.Participation was voluntary, we emphasized this before beginning the study.There were 105 students in the classroom, 91 students began the questionnaire and74 completed it.When opening RepoGrams during this study it automatically loaded the fol-lowing 5 repositories1:• https://github.com/RepoGrams/sqlitebrowser• https://github.com/RepoGrams/vim.js• https://github.com/RepoGrams/AudioStreamer• https://github.com/RepoGrams/LightTable• https://github.com/RepoGrams/html-pipelineB.2.2 Questionnaire[Questions are elaborated in the next sections]We mark our ground truth answers with an underline. For some questions thereis more than one correct answer.A. Consent formB. Demographics (1 page, 5 questions)1In the study we loaded the original repositories from which these repositories were forked. Weforked and froze these repositories post-study for reproducibility reasons.77C. Warmup questions (2 pages, 4 questions)D. Metric comprehension questions (5 pages, 6 questions)E. Questions about comparisons across projects (3 pages, 3 questions)F. Exploratory question (1 page, 1 question)G. Open comments (1 page, 2 questions)H. Raffle ticketB.2.3 Demographicsa. How many Computer Science courses have you successfully passed?• 0–5 courses• 6–10 courses• 11–15 courses• 16 or more coursesb. Have you worked in Computer Science related jobs (e.g. co-ops, intern-ships)? If so, how many academic terms have you spent working in suchjobs?• I have not worked in CS related jobs• 1 term• 2 terms• 3 terms• 4 or more termsc. How many years of experience have you had with version control systems?(e.g., mercurial, git, svn, cvs)• No experience78• 1 year of experience• 2 years of experience• 3 years of experience• 4 or more years of experienced. How often do you use version control systems?Consider any use of version control systems, whether it is during courses,jobs, or for private use.• Daily• Weekly• Monthly• Never use/usede. Have you ever used version control outside of class, for a project not relatedto schoolwork?• Yes• Nof. Note: Our online questionnaire included another question in the Demograph-ics section after e., however, we found that the question was understood am-biguously by a large number of participants. As a consequence we removedthat question from our analysis and do not report on it here.B.2.4 Warmup questions(pg. 1)• We developed a tool called RepoGrams for comparing and analyzing thehistory of multiple software projects. In this study you will help us evaluateRepoGrams.79• RepoGrams can be used to analyze various repository metrics, such as thenumber of programming languages that a project consists of over time.• A repository consists of a number of commits. RepoGrams represents acommit as a block. The blocks can be configured to have different blocklengths.• A metric measures something about a commit. RepoGrams calculates a nu-meric metric value for each commit in a project repository.• RepoGrams uses colours to represent different commit metric values. Thelegend details which colour corresponds to each metric value, or range ofvalues.Open RepoGrams in a separate tab: http://repograms.net:2000/In RepoGrams, a row of blocks represents the software repository history of asoftware project. A coloured block in one such row represents a single commit toa project repository.1. How many projects are listed in the newly opened RepoGrams tab?• 1• 2• 3• 4• 52. Using the Fixed block length, estimate the number of commits in the Au-dioStreamer project using the RepoGrams tool:• Tens (10–99)• Hundreds (100–999)• Thousands (1,000–9,999)• Tens of thousands (10,000–99,999)80(pg. 2)How to use RepoGrams• Hover your mouse pointer over the block that represents the commit (desk-top/laptop) or touch the commit (tablets) to see the commit’s unique iden-tifier (called commit hash) and the commit message. Clicking/touching thecommit will copy the first 5 letters/numbers of the commit’s hash to yourclipboard for easy pasting into the answer sheet.• When asked to identify one or more commits, type the first 5 letters/numbersof their commit hash to the answer sheet.• When asked to identify one or more projects, project repositories, or metric,type their name(s) to the answer sheet.3. Remove the AudioStreamer project repository from the view. How manyproject are now listed?• 1• 2• 3• 4• 54. Add the Postr repository (URL provided below) to the view. Using anymetric and the Lines changed (comparable btw. projects) block length andthe zoom control, how does the Postr project compare with the other projectsin terms of its size? (Options below are listed from smallest to largest)https://github.com/RepoGrams/postr• Postr is the smallest project• Postr is the second smallest project• Postr is the second largest project• Postr is the largest project81B.2.5 Metric comprehension questions(pg. 1)Languages in a Commit metricThe Languages in a Commit metric measures the number of different program-ming languages used in each commit. For example, a commit that changed oneJava file and two XML files would get the value 2 because it changed a Java fileand XML files. A commit that changed 100 Java files would get the value 1 becauseit only changed Java files.5. Using the Languages in a Commit metric and any block length, which projectis likely to contain source code written in the most diverse number of differ-ent languages?• sqlitebrowser• vim.js• LightTable• html-pipeline• postr(pg. 2)Branches Used metricThe Branches Used metric assigns a colour to each commit based on the branchthat the commit belongs to (each branch in a project is given a unique colour).6. Using the Branches Used metric and the Lines changed (incomparable btw.projects) block length, which project repository is most like to have the leastnumber of distinct branches?• sqlitebrowser• vim.js• LightTable• html-pipeline82• postr7. Using the Branches Used metric and the Lines changed (incomparable btw.projects) block length, which project repository is most like to have the mostnumber of distinct branches?• sqlitebrowser• vim.js• LightTable• html-pipeline• postr(pg. 3)Most Edited File metricThe Most Edited Files metric measures the number of times that the most editedfile in a commit has been previously modified. A commit with a high metric valueindicates that it modifies a file that was changed many times in previous commits.A commit will have a low value if it is composed of new or rarely edited files.8. Using the Most Edited File metric and the Fixed block length, what is thecommit hash of the latest commit that modified the most popular file(s) inthe Postr project?[text field][ground truth answer: bed3257](pg. 4)Commit Localization metricThe Commit Localization metric represents the fraction of the number of uniqueproject directories containing files modified by the commit.Metric value of 1 means that all the modified files in a commit are in a singledirectory. Metric value of 0 means all the project directories contain a file modifiedby the commit.839. Using the Commit Localization metric and the Fixed block length, whichproject had the longest uninterrupted sequence of commits with metric val-ues in the range 0.88–1.00?• sqlitebrowser• vim.js• LightTable• html-pipeline• postr(pg. 5)Number of Branches metricThe Number of Branches metric measures the number of branches activelyused by developers: a high value means that developers were making changes inmany different branches at the time that a commit was created.10. Using the Number of Branches metric and the Fixed block length, find onecommit in LightTable when developers were using the largest number ofconcurrent branches?There can be multiple correct answers to this question.[text field][ground truth answer: any of 522d848, 7729887, c55f44e]B.2.6 Questions about comparisons across projectsIn this set of questions you will be asked to compare different projects based onone or two metrics. For each of the following questions, explain the reason foryour choices in 1–2 short sentences.(pg. 1)11. Using the Number of Branches metric and the Lines changed (incomparablebtw. projects) block length, which project appears to have a developmentprocess most similar to LightTable?84• sqlitebrowser• vim.js• LightTable• html-pipeline• postra. Why did you choose this project?[text field](pg. 2)12. Using the Languages in a Commit metric and the Fixed block length, whichtwo project repositories appear to have the most similar development processwith each other?• sqlitebrowser• vim.js• LightTable• html-pipeline• postra. Why did you choose this project?[text field](pg. 3)Commit Message LengthThe Commit Message Length metric counts the number of words in the commitlog message of each commit.13. Using the Commit Message Length metric and the Fixed block length, whichproject has a distinct pattern in the level of detail of its commit messages thatdistinguishes it from all of the other projects?• sqlitebrowser85• vim.js• LightTable• html-pipeline• postra. Read through some of the commit messages (by hovering over thecommits) from the project you selected and briefly explain whythis project has such a distinct pattern.[text field]B.2.7 Exploratory question(pg. 1)Choosing metricsCommit messages provide current and future developers in the project withinsight for what changes were made and the reasoning behind those changes. Un-fortunately, developers sometimes neglect to detail the changes and reasons.Using the Lines changed (incomparable btw. projects) block length, find a sin-gle commit in vim.js where the developer made a significant change but neglectedto elaborate on that change. Choose any commit that is NOT the first commit inthe project. There can be multiple correct answers to this question.14. Which metric(s) did you use?[text field]Which commit did you identify?[text field][ground truth answer: any of 1557ac8, 09039d7, c343fb6, e071941,238dba7, 7390e4e]a. Write a short explanation for your choice of metric(s):[text field]86B.2.8 Open commentsa. Was there anything confusing about RepoGrams? What tasks were difficultto perform?[text field]b. Other comments about RepoGrams and this study:[text field]B.2.9 Filtering resultsWe discarded individual answers when students spent a disproportionately shorttime (<10 seconds) on the page that contained those questions. We also receivedone extra entry (a total of 75 responses) in which the participant answered 3 of the4 warmup questions wrong. We discarded this entry from our analysis and from allreports. Anonymized results are collected in the next section.B.3 Raw resultsHere we list the raw results. Cells with a dark red background denote an answer thatis incompatible with our ground truth answers. Cells with an orange backgrounddenote that the answer for this question was discarded from the report for beinganswered in less than 10 seconds, and was most likely skipped by the participant.Columns titled gt mark whether the raw answer was compatible with our groundtruth (true/false). Columns titled gt-agree and gt-disagree are the encoded labelsthat we assigned to the explanation by the participants, based on their raw textresponse. We do not report raw text responses to preserve the anonymity of theparticipants.87Table B.1: Raw results from the demographics section in the user study withundergraduate students.P a b c d e time1 16+ 0 4+ Weekly yes 00:01:172 11-15 0 1 Monthly no 00:00:533 11-15 3 3 Monthly no 00:01:144 16+ 3 2 Daily yes 00:00:115 0-5 0 2 Weekly no 00:02:096 16+ 3 1 Never yes 00:00:467 16+ 0 2 Weekly yes 00:00:468 6-10 4+ 2 Weekly yes 00:00:569 6-10 0 1 Daily no 00:01:5010 11-15 2 2 Daily yes 00:02:1511 11-15 0 1 Weekly yes 00:01:4312 6-10 2 1 Daily yes 00:00:1713 16+ 0 1 Weekly yes 00:00:3814 6-10 0 3 Daily yes 00:00:1915 6-10 0 3 Daily yes 00:01:3616 0-5 0 2 Weekly no 00:02:2517 6-10 2 1 Weekly yes 00:01:3618 11-15 2 1 Daily yes 00:01:0719 0-5 3 3 Daily yes 00:00:2820 11-15 4+ 4+ Weekly yes 00:01:2321 11-15 3 2 Monthly yes 00:00:3122 16+ 0 2 Daily yes 00:00:2923 6-10 2 1 Monthly yes 00:00:1524 6-10 0 1 Weekly yes 00:01:1125 11-15 3 1 Weekly yes 00:00:4726 11-15 2 1 Daily yes 00:01:0227 11-15 0 2 Weekly yes 00:00:5528 11-15 0 2 Weekly yes 00:02:2529 6-10 2 2 Daily yes 00:01:4730 11-15 2 2 Monthly yes 00:00:5031 11-15 3 3 Weekly yes 00:00:37P a b c d e time32 16+ 0 2 Never yes 00:00:5333 0-5 1 2 Weekly yes 00:01:1134 11-15 1 1 Monthly no 00:01:2835 6-10 1 1 Monthly yes 00:01:0736 0-5 0 2 Weekly yes 00:00:5737 6-10 2 2 Daily yes 00:00:3038 11-15 4+ 2 Weekly yes 00:00:3639 11-15 4+ 2 Weekly no 00:00:5940 0-5 0 1 Daily yes 00:00:4441 11-15 3 2 Daily yes 00:00:3842 11-15 4+ 4+ Weekly yes 00:00:5443 6-10 0 3 Monthly yes 00:00:5944 11-15 0 0 Never no 00:00:4145 11-15 4+ 2 Weekly yes 00:01:1346 6-10 2 1 Weekly no 00:01:4747 11-15 0 2 Daily yes 00:01:4548 16+ 4+ 2 Weekly yes 00:00:5049 11-15 1 1 Weekly yes 00:01:5250 6-10 2 1 Weekly yes 00:01:2551 11-15 1 1 Never yes 00:04:1952 0-5 0 2 Monthly no 00:01:4253 16+ 4+ 3 Weekly yes 00:02:0754 6-10 1 2 Weekly yes 00:01:1555 11-15 2 1 Weekly yes 00:01:3156 11-15 2 2 Daily yes 00:00:3057 6-10 0 1 Weekly no 00:01:2958 16+ 4+ 3 Daily yes 00:01:4659 11-15 1 0 Never no 00:02:1060 6-10 2 1 Weekly yes 00:01:1061 6-10 2 1 Daily yes 00:01:0162 11-15 3 3 Daily yes 00:00:50P a b c d e time63 6-10 0 1 Monthly no 00:00:5464 0-5 0 2 Daily yes 00:01:3665 16+ 3 3 Weekly yes 00:00:3166 16+ 4+ 3 Daily yes 00:01:0167 6-10 0 1 Never no 00:00:5768 16+ 4+ 3 Monthly yes 00:03:2669 6-10 2 2 Weekly yes 00:00:3870 6-10 4+ 1 Never yes 00:03:5671 16+ 2 1 Daily yes 00:00:3572 6-10 4+ 1 Monthly no 00:01:1273 16+ 3 3 Weekly yes 00:02:2774 11-15 0 2 Weekly yes 00:00:3888Table B.2: Raw results from the warmup section (questions 1–4) in the userstudy with undergraduate students.Red cells denote participant answers that do not match our ground truth for that question.P 1 2 time 3 4 time1 5 Tens 00:01:55 4 Second smallest 00:00:072 5 Tens 00:04:20 4 Second smallest 00:03:433 5 Tens 00:02:44 4 Second smallest 00:02:114 5 Tens 00:01:37 4 Smallest 00:00:475 5 Tens 00:01:45 4 Smallest 00:01:066 5 Tens 00:02:45 4 Second smallest 00:02:057 5 Tens 00:02:09 4 Second smallest 00:02:058 5 Tens 00:01:56 4 Second smallest 00:01:359 5 Tens 00:02:29 4 Smallest 00:02:0310 5 Tens 00:01:31 4 Second smallest 00:00:0311 5 Tens 00:04:15 4 Smallest 00:01:1312 5 Tens 00:01:09 4 Second smallest 00:01:4713 5 Tens 00:01:50 4 Smallest 00:01:5414 5 Tens 00:00:47 4 Largest 00:00:4015 5 Tens 00:02:36 4 Second smallest 00:03:2116 5 Tens 00:01:26 4 Smallest 00:00:4417 5 Tens 00:02:13 4 Second smallest 00:01:4418 5 Hundreds 00:02:38 4 Second smallest 00:02:3119 5 Tens 00:01:12 4 Second largest 00:00:4020 5 Tens 00:03:08 4 Second smallest 00:02:0421 5 Hundreds 00:03:36 4 Second smallest 00:02:0822 5 Tens 00:01:37 4 Second largest 00:05:1523 5 Tens 00:00:03 4 Second smallest 00:00:1024 5 Tens 00:02:35 4 Smallest 00:01:4625 5 Tens 00:01:32 4 Smallest 00:01:4426 5 Tens 00:02:17 4 Second smallest 00:01:5427 5 Tens 00:01:27 4 Second smallest 00:01:3128 5 Tens 00:02:45 4 Second smallest 00:01:4729 5 Tens 00:01:56 4 Second smallest 00:01:2130 5 Tens 00:01:10 4 Second smallest 00:01:4531 5 Tens 00:02:07 4 Largest 00:01:17P 1 2 time 3 4 time32 5 Tens 00:04:02 4 Smallest 00:00:4633 5 Tens 00:04:37 4 Second smallest 00:02:0134 5 Tens 00:02:52 4 Second smallest 00:02:0135 5 Hundreds 00:02:13 4 Second smallest 00:01:1236 5 Tens 00:02:30 4 Smallest 00:05:1737 5 Tens 00:06:34 4 Smallest 00:01:0138 5 Hundreds 00:01:46 4 Second smallest 00:00:5239 5 Hundreds 00:04:00 4 Smallest 00:03:0240 5 Tens 00:01:31 4 Second largest 00:00:5441 5 Tens 00:03:16 4 Second smallest 00:00:1042 5 Tens 00:02:04 4 Smallest 00:02:3143 5 Tens 00:02:26 4 Smallest 00:02:4344 5 Tens 00:03:46 4 Second smallest 00:01:5045 5 Tens 00:01:57 4 Second smallest 00:03:0146 5 Thousands 00:06:38 4 Second largest 00:02:1747 5 Tens 00:01:04 4 Smallest 00:00:5448 5 Hundreds 00:04:01 4 Second smallest 00:02:1149 5 Tens 00:01:10 4 Second smallest 00:01:2750 5 Tens 00:02:13 4 Second smallest 00:03:5151 5 Tens 00:02:33 4 Smallest 00:01:3252 5 Tens 00:01:30 4 Smallest 00:01:2253 5 Tens 00:04:09 4 Second smallest 00:01:4754 5 Tens 00:07:07 4 Second smallest 00:00:0855 5 Tens 00:02:48 4 Second smallest 00:00:5556 5 Tens 00:00:58 4 Second smallest 00:00:4557 5 Tens 00:02:13 4 Second smallest 00:02:1358 5 Tens 00:01:56 4 Second smallest 00:00:4959 5 Tens 00:05:41 4 Second smallest 00:01:2860 5 Tens 00:00:53 4 Second smallest 00:02:0961 5 Tens 00:01:40 4 Second smallest 00:03:1262 5 Tens 00:02:46 4 Second smallest 00:01:39P 1 2 time 3 4 time63 5 Tens 00:02:46 4 Second smallest 00:01:1364 5 Tens 00:04:21 4 Second smallest 00:01:4865 5 Tens 00:02:15 4 Second smallest 00:03:0966 5 Tens 00:01:21 4 Second smallest 00:01:0467 5 Thousands 00:01:32 4 Largest 00:00:5668 5 Tens 00:01:46 4 Smallest 00:03:0669 5 Tens 00:00:19 4 Second smallest 00:01:1270 5 Tens 00:03:57 4 Smallest 00:01:3271 5 Tens 00:00:06 4 Second smallest 00:01:1372 5 Tens 00:02:20 4 Second smallest 00:00:5173 5 Tens 00:03:06 4 Second smallest 00:00:1074 5 Hundreds 00:03:16 4 Second smallest 00:01:3289Table B.3: Raw results from the metrics comprehension section (questions5–7) in the user study with undergraduate students.Red cells denote participant answers that do not match our ground truth for that question.Orange cells denote participant answers that took less than 10 seconds, and were ignored.P 5 time 6 7 time1 vim.js 00:00:38 vim.js html-pipeline 00:01:392 vim.js 00:02:44 vim.js html-pipeline 00:01:213 vim.js 00:00:56 vim.js html-pipeline 00:01:144 vim.js 00:00:16 vim.js html-pipeline 00:00:555 LightTable 00:00:57 vim.js html-pipeline 00:01:226 vim.js 00:01:29 vim.js html-pipeline 00:01:267 vim.js 00:01:18 sqlitebrowser html-pipeline 00:02:388 vim.js 00:01:22 vim.js html-pipeline 00:02:049 vim.js 00:01:19 vim.js html-pipeline 00:00:4710 skipped vim.js html-pipeline 00:03:5111 skipped vim.js html-pipeline 00:01:1412 LightTable 00:01:38 vim.js html-pipeline 00:01:2313 LightTable 00:00:30 vim.js LightTable 00:01:0514 vim.js 00:00:53 vim.js html-pipeline 00:00:4215 vim.js 00:01:39 vim.js html-pipeline 00:00:1116 vim.js 00:00:37 vim.js html-pipeline 00:02:1317 vim.js 00:01:35 vim.js html-pipeline 00:01:1518 vim.js 00:04:35 sqlitebrowser html-pipeline 00:01:5319 vim.js 00:00:55 vim.js html-pipeline 00:00:4220 vim.js 00:00:53 vim.js html-pipeline 00:01:5321 vim.js 00:01:38 postr LightTable 00:00:4322 vim.js 00:01:44 vim.js html-pipeline 00:01:0523 skipped vim.js html-pipeline 00:00:3924 vim.js 00:01:32 vim.js html-pipeline 00:01:4525 LightTable 00:01:29 sqlitebrowser LightTable 00:01:4426 vim.js 00:00:40 LightTable vim.js 00:01:2027 LightTable 00:01:27 vim.js html-pipeline 00:00:4228 vim.js 00:00:40 vim.js html-pipeline 00:01:3529 vim.js 00:00:43 vim.js html-pipeline 00:01:2030 vim.js 00:00:37 html-pipeline vim.js 00:00:3131 LightTable 00:02:24 vim.js html-pipeline 00:00:5932 sqlitebrowser 00:01:22 vim.js html-pipeline 00:01:5833 vim.js 00:07:44 vim.js html-pipeline 00:01:2034 vim.js 00:00:54 vim.js html-pipeline 00:01:5235 vim.js 00:00:29 vim.js html-pipeline 00:01:3336 vim.js 00:01:05 vim.js html-pipeline 00:02:3537 LightTable 00:00:36 vim.js html-pipeline 00:01:4638 LightTable 00:00:20 vim.js html-pipeline 00:00:4239 vim.js 00:01:30 vim.js html-pipeline 00:01:2240 postr 00:00:35 LightTable html-pipeline 00:02:4541 vim.js 00:00:37 LightTable html-pipeline 00:01:23P 5 time 6 7 time42 LightTable 00:01:41 vim.js html-pipeline 00:01:4743 vim.js 00:02:34 sqlitebrowser html-pipeline 00:03:1044 vim.js 00:00:58 vim.js html-pipeline 00:01:3545 LightTable 00:01:38 vim.js html-pipeline 00:07:4346 skipped vim.js html-pipeline 00:02:3447 vim.js 00:01:21 vim.js html-pipeline 00:00:5648 vim.js 00:01:25 vim.js html-pipeline 00:01:0949 vim.js 00:01:11 sqlitebrowser html-pipeline 00:00:5050 vim.js 00:01:21 vim.js html-pipeline 00:02:3951 skipped vim.js html-pipeline 00:01:0052 LightTable 00:01:02 html-pipeline vim.js 00:01:5253 LightTable 00:03:35 vim.js html-pipeline 00:01:1254 skipped skipped skipped55 vim.js 00:00:50 html-pipeline LightTable 00:00:5256 vim.js 00:00:31 vim.js postr 00:01:2157 skipped skipped skipped58 vim.js 00:00:45 postr sqlitebrowser 00:01:1959 vim.js 00:00:55 vim.js html-pipeline 00:01:0860 sqlitebrowser 00:00:52 vim.js html-pipeline 00:01:0061 vim.js 00:00:53 vim.js html-pipeline 00:01:5662 vim.js 00:00:57 vim.js html-pipeline 00:01:1963 vim.js 00:01:39 skipped skipped64 vim.js 00:01:16 vim.js html-pipeline 00:01:2165 LightTable 00:01:36 vim.js html-pipeline 00:01:5866 LightTable 00:00:41 vim.js postr 00:01:0467 vim.js 00:01:29 vim.js html-pipeline 00:01:0068 sqlitebrowser 00:01:09 vim.js html-pipeline 00:01:0669 vim.js 00:00:55 vim.js html-pipeline 00:00:4970 LightTable 00:01:33 vim.js html-pipeline 00:01:1571 vim.js 00:00:51 vim.js html-pipeline 00:01:5472 vim.js 00:00:35 sqlitebrowser html-pipeline 00:00:5173 vim.js 00:00:15 vim.js html-pipeline 00:01:1274 vim.js 00:00:38 LightTable vim.js 00:01:0490Table B.4: Raw results from the metrics comprehension section (questions8–10) in the user study with undergraduate students.Red cells denote participant answers that do not match our ground truth for that question.Orange cells denote participant answers that took less than 10 seconds, and were ignored.P 8 time 9 time 10 time1 bed325 TRUE 00:01:04 sqlitebrowser 00:01:15 c55f44e TRUE 00:01:042 bed325 TRUE 00:03:01 sqlitebrowser 00:03:42 c55f44e TRUE 00:01:443 bed325 TRUE 00:01:36 sqlitebrowser 00:00:52 c55f44e TRUE 00:01:034 bed325 TRUE 00:00:44 sqlitebrowser 00:00:33 80ac53 FALSE 00:00:345 bed325 TRUE 00:02:13 sqlitebrowser 00:01:52 c55f44e TRUE 00:01:196 bed325 TRUE 00:02:20 sqlitebrowser 00:01:30 522d84 TRUE 00:01:017 bed325 TRUE 00:06:15 sqlitebrowser 00:02:16 c55f44e TRUE 00:00:578 77b4d FALSE 00:01:08 sqlitebrowser 00:00:53 b1560 FALSE 00:00:419 2f1b685 FALSE 00:01:27 sqlitebrowser 00:00:32 4dc0bb FALSE 00:00:4710 bed325 TRUE 00:01:14 postr 00:01:10 c55f44e TRUE 00:01:2611 bed325 TRUE 00:01:23 sqlitebrowser 00:01:17 f12727f FALSE 00:00:4312 bed325 TRUE 00:00:54 sqlitebrowser 00:00:49 c55f44e TRUE 00:00:5613 bed325 TRUE 00:00:43 sqlitebrowser 00:00:55 7e8e21 FALSE 00:00:5314 b8d043 FALSE 00:00:26 LightTable 00:02:02 3dad78 FALSE 00:00:3115 bed32 TRUE 00:01:58 sqlitebrowser 00:01:52 c55f4 TRUE 00:01:2016 bed325 TRUE 00:02:16 sqlitebrowser 00:01:37 772988 TRUE 00:00:3117 bed325 TRUE 00:01:23 sqlitebrowser 00:01:13 c55f44e TRUE 00:00:4918 68ab71 FALSE 00:00:51 sqlitebrowser 00:00:43 d22605 FALSE 00:00:4019 bed325 TRUE 00:00:51 sqlitebrowser 00:01:01 772988 TRUE 00:00:4620 bed325 TRUE 00:01:50 vim.js 00:00:38 522d84 TRUE 00:00:3921 e2f651ff FALSE 00:01:37 sqlitebrowser 00:00:46 c55f44e TRUE 00:02:2922 b8d043 FALSE 00:01:01 sqlitebrowser 00:01:08 21e3dd FALSE 00:00:5223 d3566 FALSE 00:00:24 sqlitebrowser 00:00:25 77298 TRUE 00:00:4224 sqlitebrowser 00:02:10 772988 TRUE 00:01:0025 bed325 TRUE 00:01:00 sqlitebrowser 00:01:23 c55f44e TRUE 00:01:2426 944e02 FALSE 00:01:45 html-pipeline 00:01:41 c55f44e TRUE 00:00:4927 bed325 TRUE 00:01:00 sqlitebrowser 00:01:38 c55f44e TRUE 00:00:4028 bed325 TRUE 00:01:36 sqlitebrowser 00:01:54 c55f44e TRUE 00:02:2429 77b4d5 FALSE 00:00:53 sqlitebrowser 00:00:53 4dc0bb FALSE 00:00:4730 bed325 TRUE 00:01:47 sqlitebrowser 00:01:48 c55f44e TRUE 00:00:5431 284a48 FALSE 00:01:15 sqlitebrowser 00:00:30 4dc0bb FALSE 00:01:1432 bed325 TRUE 00:01:30 postr 00:00:37 023441 FALSE 00:01:1033 ca78b0 FALSE 00:01:15 sqlitebrowser 00:02:55 522d84 TRUE 00:02:1834 f8a1769 FALSE 00:02:19 sqlitebrowser 00:01:47 c55f44e TRUE 00:00:5835 bed325 TRUE 00:01:36 sqlitebrowser 00:00:51 c55f44e TRUE 00:00:5436 bed325 TRUE 00:01:41 sqlitebrowser 00:03:19 c55f44e TRUE 00:01:4437 f829f06 FALSE 00:01:24 sqlitebrowser 00:01:10 023441 FALSE 00:01:0338 944e02 FALSE 00:00:34 sqlitebrowser 00:00:44 772988 TRUE 00:00:3939 bed325 TRUE 00:01:20 sqlitebrowser 00:01:26 c55f44e TRUE 00:01:2040 b8d043 FALSE 00:01:41 postr 00:00:53 522d84 TRUE 00:01:1641 bed325 TRUE 00:00:42 sqlitebrowser 00:00:51 772988 TRUE 00:00:3791P 8 time 9 time 10 time42 bed325 TRUE 00:01:24 sqlitebrowser 00:02:0343 bed325 TRUE 00:01:21 sqlitebrowser 00:01:43 772988 TRUE 00:01:1344 bc159 FALSE 00:02:15 sqlitebrowser 00:01:15 4dc0b FALSE 00:01:4745 944e02 FALSE 00:04:23 sqlitebrowser 00:05:12 772988 TRUE 00:02:1346 944e02 FALSE 00:02:39 sqlitebrowser 00:01:0147 bed325 TRUE 00:01:37 sqlitebrowser 00:01:14 772988 TRUE 00:00:3248 bed325 TRUE 00:01:30 sqlitebrowser 00:03:00 c55f44e TRUE 00:00:5149 944e02 FALSE 00:00:46 sqlitebrowser 00:00:38 772988 TRUE 00:00:4250 bed32 TRUE 00:06:00 sqlitebrowser 00:00:57 c55f4 TRUE 00:00:5851 bed327 TRUE 00:01:53 sqlitebrowser 00:00:43 522d84 TRUE 00:01:3752 d957e4 FALSE 00:01:19 html-pipeline 00:00:59 772988 TRUE 00:01:0753 bed325 TRUE 00:02:23 sqlitebrowser 00:00:36 c55f44e TRUE 00:00:3754 00:00:51 sqlitebrowser 00:01:09 522d84 TRUE 00:00:4955 77b4d5 FALSE 00:01:12 sqlitebrowser 00:00:40 c55f44e TRUE 00:01:2356 c1f9dae FALSE 00:00:58 html-pipeline 00:00:39 d22605 FALSE 00:00:4357 bed325 TRUE 00:01:22 sqlitebrowser 00:01:41 c55f44e TRUE 00:01:0858 944e02 FALSE 00:00:52 sqlitebrowser 00:01:37 c55f44e TRUE 00:00:3959 bed32 TRUE 00:01:24 sqlitebrowser 00:01:29 c55f4 TRUE 00:04:1960 bed325 TRUE 00:01:23 sqlitebrowser 00:02:32 c55f44e TRUE 00:01:0761 944e02 FALSE 00:01:21 sqlitebrowser 00:02:08 772988 TRUE 00:00:4962 bed325 TRUE 00:01:11 sqlitebrowser 00:01:44 c55f44e TRUE 00:00:4963 bed325 TRUE 00:01:12 sqlitebrowser 00:00:34 c55f44e TRUE 00:00:5164 bed325 TRUE 00:01:32 sqlitebrowser 00:01:27 c55f44e TRUE 00:01:1765 bed325 TRUE 00:01:47 sqlitebrowser 00:02:44 c55f44e TRUE 00:00:5366 77b4d5 FALSE 00:01:03 sqlitebrowser 00:00:39 b15607 FALSE 00:00:4467 428940 FALSE 00:01:38 sqlitebrowser 00:00:50 c55f44e TRUE 00:01:1668 944e02 FALSE 00:01:11 sqlitebrowser 00:01:17 c55f44e TRUE 00:00:4369 a4be10 FALSE 00:00:36 sqlitebrowser 00:00:29 ec1f3a2 FALSE 00:00:3970 73ef810 FALSE 00:02:19 vim.js 00:01:41 c55f44e TRUE 00:00:5171 bed325 TRUE 00:02:57 sqlitebrowser 00:01:07 522d84 TRUE 00:01:1072 944e02 FALSE 00:01:02 sqlitebrowser 00:01:16 c55f44e TRUE 00:00:4473 bed325 TRUE 00:01:55 sqlitebrowser 00:01:46 c55f44e TRUE 00:01:5774 944e02 FALSE 00:00:43 sqlitebrowser 00:01:12 9c283d FALSE 00:00:2192Table B.5: Raw results from the project comparison section (questions 11–13) in the user study with undergraduate students.Red cells denote participant answers that do not match our ground truth for that question.Orange cells denote participant answers that took less than 10 seconds, and were ignored.P 11 gt-agree gt-disagree time 12 12 html-pip gt-agree gt-disagree time 13 gt-agree gt-disagree time1 html-pipeline BRA 00:02:31 html-pipeline postr TRUE LAN 00:01:48 vim.js AUT 00:01:322 html-pipeline BRA 00:02:18 html-pipeline postr TRUE LAN 00:02:53 postr OTH 00:03:533 html-pipeline BRA 00:01:38 postr html-pipeline TRUE LAN 00:02:13 vim.js LEN 00:02:344 html-pipeline VIS 00:00:43 html-pipeline postr TRUE VIS 00:00:45 vim.js AUT 00:00:555 html-pipeline BRA 00:01:29 html-pipeline postr TRUE LAN 00:01:42 vim.js AUT 00:02:116 html-pipeline VIS 00:01:46 postr html-pipeline TRUE VIS 00:03:30 vim.js AUT 00:02:547 html-pipeline BRA,VIS 00:03:19 html-pipeline postr TRUE LAN,VIS 00:01:41 html-pipeline LEN 00:07:278 html-pipeline BRA 00:03:00 html-pipeline postr TRUE LAN 00:00:54 vim.js AUT 00:01:189 html-pipeline BRA 00:01:54 html-pipeline postr TRUE LAN 00:00:51 vim.js AUT 00:00:3810 sqlitebrowser BRA 00:03:11 postr html-pipeline TRUE LAN 00:02:03 vim.js AUT 00:01:4611 html-pipeline 00:01:02 html-pipeline postr TRUE OTH 00:00:48 LightTable 00:00:1212 html-pipeline BRA 00:02:00 vim.js LightTable FALSE LAN 00:01:57 vim.js AUT 00:01:0413 sqlitebrowser BRA 00:02:07 vim.js LightTable FALSE OTH 00:01:06 vim.js LEN 00:01:2514 html-pipeline VIS 00:00:33 vim.js LightTable FALSE VIS 00:00:24 sqlitebrowser 00:00:5315 html-pipeline BRA 00:03:28 html-pipeline postr TRUE VIS 00:03:14 vim.js LEN 00:04:5716 html-pipeline LIN 00:01:33 html-pipeline postr TRUE LAN 00:01:41 sqlitebrowser OTH 00:02:5917 html-pipeline BRA 00:02:46 html-pipeline postr TRUE LAN 00:02:25 vim.js AUT 00:01:3918 html-pipeline BRA 00:02:13 html-pipeline postr TRUE LAN 00:01:06 vim.js VIS 00:00:4619 html-pipeline LIN 00:01:07 html-pipeline postr TRUE LAN 00:01:01 sqlitebrowser OTH 00:01:1420 sqlitebrowser VIS 00:00:58 html-pipeline postr TRUE VIS 00:00:51 vim.js VIS 00:00:5821 html-pipeline BRA 00:01:42 postr html-pipeline TRUE OTH 00:01:18 sqlitebrowser 00:01:2222 html-pipeline BRA 00:01:30 html-pipeline postr TRUE LAN 00:01:08 LightTable LEN 00:01:5323 html-pipeline BRA 00:01:43 html-pipeline postr TRUE LAN 00:01:33 vim.js AUT 00:02:0424 html-pipeline BRA 00:01:53 html-pipeline postr TRUE LAN 00:01:48 vim.js AUT 00:03:4725 html-pipeline VIS 00:02:08 html-pipeline postr TRUE VIS 00:01:03 vim.js AUT 00:01:2826 html-pipeline BRA 00:01:38 html-pipeline postr TRUE LAN 00:01:44 vim.js AUT 00:02:2027 html-pipeline BRA 00:00:57 html-pipeline postr TRUE LAN 00:01:02 vim.js AUT 00:02:2128 html-pipeline VIS 00:01:09 html-pipeline postr TRUE VIS 00:01:15 sqlitebrowser VIS 00:01:3529 html-pipeline BRA 00:01:05 html-pipeline postr TRUE LAN 00:01:12 vim.js LEN 00:01:0330 html-pipeline BRA 00:01:44 html-pipeline postr TRUE LAN 00:01:06 vim.js AUT 00:01:2331 sqlitebrowser VIS 00:00:49 sqlitebrowser LightTable FALSE VIS 00:00:56 vim.js AUT 00:00:4532 html-pipeline VIS 00:01:22 html-pipeline postr TRUE OTH 00:01:45 vim.js AUT 00:01:2333 html-pipeline VIS 00:00:53 html-pipeline postr TRUE VIS 00:01:12 vim.js LEN 00:01:4034 html-pipeline VIS 00:01:23 html-pipeline postr TRUE LAN 00:01:28 vim.js VIS 00:03:5635 html-pipeline VIS 00:01:10 LightTable sqlitebrowser FALSE VIS 00:02:01 vim.js AUT 00:02:3036 html-pipeline BRA,VIS 00:02:20 postr html-pipeline TRUE LAN,VIS 00:02:08 vim.js AUT 00:01:2337 html-pipeline VIS 00:01:40 html-pipeline postr TRUE VIS 00:01:51 vim.js AUT 00:01:2838 html-pipeline LIN 00:01:28 postr html-pipeline TRUE OTH 00:00:56 vim.js AUT 00:01:0739 html-pipeline VIS 00:01:07 LightTable vim.js FALSE VIS 00:02:54 sqlitebrowser OTH 00:03:5940 sqlitebrowser VIS 00:01:13 sqlitebrowser html-pipeline FALSE VIS 00:01:14 vim.js AUT 00:01:3241 html-pipeline LIN 00:01:54 html-pipeline postr TRUE LAN,VIS 00:00:56 vim.js AUT 00:01:2242 html-pipeline BRA 00:06:33 sqlitebrowser LightTable FALSE LAN 00:02:16 vim.js AUT 00:01:1943 html-pipeline VIS 00:01:06 html-pipeline postr TRUE LAN 00:03:13 vim.js AUT 00:02:4144 html-pipeline BRA 00:01:42 html-pipeline postr TRUE VIS 00:02:23 vim.js AUT 00:02:4145 skipped html-pipeline postr TRUE LAN 00:00:19 sqlitebrowser LEN 00:02:4746 vim.js 00:03:43 sqlitebrowser html-pipeline FALSE OTH 00:01:32 LightTable 00:00:5347 html-pipeline BRA 00:01:22 LightTable vim.js FALSE LAN 00:01:39 vim.js AUT 00:01:4248 html-pipeline BRA 00:01:54 html-pipeline postr TRUE LAN 00:01:25 vim.js AUT 00:02:2249 html-pipeline VIS 00:01:08 html-pipeline postr TRUE VIS 00:01:11 vim.js AUT 00:01:2250 html-pipeline BRA,VIS 00:03:44 html-pipeline postr TRUE LAN 00:02:01 vim.js LEN 00:01:3051 html-pipeline VIS 00:00:44 html-pipeline postr TRUE VIS 00:00:39 skipped AUT52 html-pipeline VIS 00:01:13 html-pipeline postr TRUE VIS 00:00:57 vim.js AUT 00:01:1253 html-pipeline VIS 00:01:45 html-pipeline postr TRUE LAN 00:01:12 vim.js AUT 00:01:0554 html-pipeline VIS 00:00:50 html-pipeline postr TRUE VIS 00:00:49 vim.js AUT 00:01:0455 html-pipeline 00:01:03 html-pipeline postr TRUE OTH 00:00:30 vim.js LEN 00:01:0956 vim.js BRA 00:01:15 html-pipeline postr TRUE LAN,VIS 00:01:08 vim.js AUT 00:01:0057 html-pipeline BRA 00:02:02 html-pipeline postr TRUE LAN 00:02:21 vim.js AUT 00:01:2658 html-pipeline BRA,VIS 00:01:06 html-pipeline postr TRUE VIS 00:01:25 vim.js AUT 00:01:4359 html-pipeline VIS 00:02:36 postr html-pipeline TRUE VIS 00:00:54 vim.js VIS 00:02:2760 html-pipeline BRA 00:01:37 html-pipeline postr TRUE LAN 00:03:13 vim.js AUT 00:02:1161 html-pipeline BRA,VIS 00:02:25 html-pipeline postr TRUE LAN 00:01:35 vim.js LEN 00:02:2362 html-pipeline BRA,VIS 00:01:08 html-pipeline postr TRUE LAN 00:01:49 vim.js AUT 00:01:4963 html-pipeline BRA 00:03:24 html-pipeline postr TRUE LAN 00:02:01 sqlitebrowser OTH 00:03:1864 html-pipeline 00:02:41 html-pipeline postr TRUE LAN 00:02:27 skipped65 html-pipeline VIS 00:00:47 html-pipeline postr TRUE LAN,VIS 00:01:09 vim.js AUT 00:01:1866 html-pipeline VIS 00:01:22 vim.js LightTable FALSE VIS 00:00:57 vim.js LEN 00:01:0067 html-pipeline BRA 00:02:00 LightTable vim.js FALSE LAN 00:02:00 sqlitebrowser OTH 00:02:2268 vim.js BRA,VIS 00:02:04 html-pipeline postr TRUE LAN 00:01:35 vim.js AUT 00:01:4669 html-pipeline BRA 00:00:51 html-pipeline postr TRUE LAN 00:01:31 vim.js AUT 00:02:2770 sqlitebrowser VIS 00:00:52 html-pipeline postr TRUE VIS 00:03:26 vim.js VIS 00:01:4993P 11 gt-agree gt-disagree time 12 12 html-pip gt-agree gt-disagree time 13 gt-agree gt-disagree time71 html-pipeline BRA 00:01:14 vim.js LightTable FALSE LAN 00:01:07 vim.js LEN 00:00:5372 html-pipeline BRA 00:01:24 html-pipeline postr TRUE LAN 00:01:39 vim.js AUT 00:01:2173 html-pipeline BRA 00:01:29 postr html-pipeline TRUE LAN 00:01:30 vim.js AUT 00:01:4674 vim.js VIS 00:01:43 sqlitebrowser LightTable FALSE VIS 00:02:46 vim.js VIS 00:01:4194Table B.6: Raw results from the exploratory question in the user study withundergraduate students.Red cells denote participant answers that do not match our ground truth for that question.Orange cells denote participant answers that took less than 10 seconds, and were ignored.P 14-explain (raw text) Metric 1 Metric 2* Attrib 14 gt gt-agree gt-disagree time1 Commit message length Commit Message Lengt ~ Block Length 580455 FALSE LIN,MSG 00:01:562 Commit Message Length Commit Message Lengt ? ? 7eddc5 FALSE MSG 00:03:013 Commit Localization Commit Localization ? ? 238dba TRUE MET 00:03:594 Most edited file Most Edited File ~ Commit Message ~5 Most edited files Most Edited File ? ? 09039d TRUE MET 00:04:146 Message length Commit Message Lengt ? ? be4b34 FALSE LIN,MSG 00:03:147 commit message length Commit Message Lengt ~ Block Length e07194 TRUE LIN,MSG 00:03:578 Commit Message Length Commit Message Lengt ? ? a430a FALSE MSG 00:01:439 Most edited file Most Edited File ~ Block Length 09039d TRUE LIN 00:02:1510 Commit message length Commit Message Lengt ~ Block Length 1557ac TRUE LIN,MET 00:03:2011 FALSE 00:00:1312 Commit message length Commit Message Lengt ~ Block Length 7390e4 TRUE LIN,MSG 00:04:57131415 Commit message length, Mos Commit Message Lengt ~ Commit Message 1557a TRUE MET,MSG 00:04:0216 Commit Message Length Commit Message Lengt ? ? 1557ac TRUE MSG 00:03:4417 commit message length Commit Message Lengt ~ Block Length 1557ac TRUE LIN,MSG 00:05:0518 most edited file Most Edited File ? ? eef9a3f FALSE MET 00:02:4019 Message Length, Lines chang Commit Message Lengt ~ Block Length c343fb TRUE LIN,MSG 00:03:2920 Commit message length Commit Message Lengt ? ? 522d84 FALSE MSG 00:02:2921 FALSE 00:00:1022 commit message length Commit Message Lengt ? ? 580455 FALSE MSG 00:01:3023 Commit Message Length Commit Message Lengt ~ Block Length 8f5b0 FALSE LIN,MSG 00:02:4424 Languages in a commit Languages in a Commit ~ Commit Message 860100 FALSE MET,MSG 00:04:1525 Commit message length Commit Message Lengt ~ Block Length 1557ac TRUE LIN,MSG 00:02:1326 Commit Localization Commit Localization ? ? 5e1492 FALSE MET 00:03:5627 Commit Message Length Commit Message Lengt ~ Block Length c343fb TRUE LIN,MSG 00:02:3328 Commit message length Commit Message Lengt ~ Block Length e07194 TRUE LIN,MSG 00:04:1929 commit message length Commit Message Lengt ~ Block Length 7390e4 TRUE LIN,MSG 00:02:2530 Commit Message Length Commit Message Lengt ? ~ 7390e4 TRUE LIN,MSG 00:02:3131 Commit Message Length Commit Message Lengt ? ? e07194 TRUE MSG 00:01:3832 Commit Message Length Commit Message Lengt ~ Block Length 9c283d FALSE LIN 00:02:3233 most edited file Most Edited File ~ Commit Message 09039d TRUE LIN,MSG 00:03:4034 commit localization Commit Message Lengt Commit Localization ~ 8f5b0e FALSE LIN,MET 00:02:2935 Commit Message Length Commit Message Lengt ~ Block Length 9c283d FALSE LIN,MSG 00:02:2636 Localization Commit Localization ~ Commit Message 238dba TRUE MET,MSG 00:04:4737 Languages in a commit Languages in a Commit ~ Commit Message 87eff91 FALSE MET,MSG 00:03:1138 Most edited file Most Edited File ? ? 09039d TRUE MET 00:02:2139 Commit Localization Commit Localization ~ Commit Message 9c283d FALSE MSG 00:04:0640 Commit Localization Commit Localization ? ? 1557ac TRUE MET 00:02:5441 Commit Localization Commit Localization ~ Commit Message 58ab0d FALSE MET,MSG 00:05:4942 The biggest whitest block Commit Message Lengt ~ Block Length 7390e4 TRUE LIN,MSG 00:02:5243 Commit Message Length Commit Message Lengt ? ? e07194 TRUE 00:03:0244 commit message length Commit Message Lengt ? ? 7390e TRUE MSG 00:02:3545 Commit message length Commit Message Lengt ? ? c343fb TRUE MSG 00:04:2446 block length ? ? Block Length 7390e4 TRUE 00:04:3447 Commit message length and l Commit Message Lengt ~ Block Length 09039d TRUE LIN,MSG 00:02:2348 Commit Localization Commit Localization ? ? 3748e4 FALSE MET 00:03:2249 most edited file Most Edited File ~ Commit Message b8d043 FALSE MET,MSG 00:03:1350 Commit Message Length Commit Message Lengt ~ Block Length 7390e TRUE LIN,MSG 00:02:3651 color ? ? ? 4ab142 FALSE MSG 00:03:1152 Commit Message Length Commit Message Lengt ? ? c343fb TRUE MSG 00:02:1895P 14-explain (raw text) Metric 1 Metric 2* Attrib 14 gt gt-agree gt-disagree time53 Commit Message Length Commit Message Lengt ? ? e07194 TRUE MSG 00:02:1554 fix buils ? ~ Block Length fe9105 FALSE MET,MSG 00:03:5755 most edited file Most Edited File ? ? 09039d TRUE 00:01:5056 commit message length Commit Message Lengt ? ? c343fb TRUE MSG 00:02:2157 Commit message length Commit Message Lengt ~ Block Length c343fb TRUE LIN,MSG 00:03:2958 Length of block and color Commit Message Lengt ~ Block Length 7390e4 TRUE LIN,MSG 00:02:4059 Commit Message Length Commit Message Lengt ~ Block Length60 Commit Localization, Commit Commit Message Lengt Commit Localization ~ 238dba TRUE MET,MSG 00:09:5661 commit message length Commit Message Lengt ? ? c343fb TRUE MSG 00:04:1962 Most Edited File Most Edited File ? ? eef9a3f FALSE MET 00:02:2163 Most files edited Most Edited File FALSE 00:02:4964 00:02:4565 Commit Metric Length Commit Message Lengt ~ Block Length c343fb TRUE LIN,MSG 00:03:3466 Block Length ? ? Block Length 1557ac TRUE LIN 00:01:0867 Message Length Commit Message Lengt ~ Block Length 238dba TRUE LIN,MSG 00:02:4168 commit message length Commit Message Lengt ? ? 580455 FALSE MSG 00:02:0169 commit localization Commit Localization ~ Commit Message 3748e4 FALSE MET,MSG 00:02:2070 color ? ? ? ce8a4f FALSE MET 00:03:3371 Most Edited File Most Edited File ? ? 85985d FALSE 00:02:2572 Commit Localization Commit Localization ~ Commit Message c55f44 FALSE MET,MSG 00:10:2273 meesage length ? ~ Message Length 7390e4 TRUE MSG 00:03:3274 commit message length Commit Message Lengt ? ? 1557ac TRUE MSG 00:01:5496• VIS The explanation discusses the visual characteristics of the repositoryfootprints. e.g., “Similar visualization”, “Similar pattern”, “Similar colors”• BRA The explanation discusses the branches, based on the values presentedby the branches used or number of branches metrics. e.g.,“Increased numberof branches over time”, “Similar number of branches”• LIN The explanation discusses the number of LoC changed, based on thelength of commit blocks when the appropriate block length mode was se-lected. e.g., “High number of lines changed towards the end”, “Larger com-mits towards the end”• LAN The explanation discusses the number of distinct programming lan-guages involved in a commit, based on the languages in a commit metric.e.g., “Both projects use few languages”• AUT The explanation discusses the scripted/automated commits performedon the vim.js project• LEN The explanation discusses the difference in commit message length,based on either the commit message length metric or based on reading theactual commit message on the commit block• MSG The explanation discusses the contents of the commit message of thecommit that was selected by the participant in the exploratory question• MET The explanation discusses the metric values of the metric that was se-lected by the participant in the exploratory question• OTH Other types of explanations or unclear explanations97Appendix CSoftware engineering researchersstudyThis appendix contains meta-data and raw results for the user study with SE re-searchers described in Section 5.2.C.1 Protocol and questionnaireC.1.1 Procedure overviewThe focus of the study is to show that participants can use RepoGrams for patternidentification. We want to focus the study on finding clusters/categories in therepositories.The study will be conducted individually with each participant, either in ameeting room or over the web with a video conference software.C.1.2 Study protocolWe begin the description of each question with the following data:[metric1, metric2, . . . , metricN ; metrics-grouping ; block-length ; repo-url1, repo-url2,. . . , repo-urlN]This lists the various settings that RepoGrams will be set to prior to each ques-tion: which metric(s) will be used, which grouping mode, which block length98mode, and what repository URLs.For the preparation and main tasks we will notify the participants a minutebefore they are running out of time. When the participant runs out of time we willask them to mark “skipped” on the question and skip to the next question. At anypoint we will give the participants the option of skipping a task that they cannotcomplete. We will use the pilot studies to estimate how long the tasks should take.The participants will answer the tasks using a custom online questionnaire soft-ware.C.1.3 QuestionnaireDemographicsa. Gender• Male• Female• Other/rather not sayb. How often do you use Distributed Version Control Systems?• Never/Rarely• Once a month• Once a week• Dailyc. What is your academic status?• Undergraduate student• Masters student• PhD student• Postdoc• Faculty99d. [multiple choice] As part of your research, have you performed any of thefollowing?• Studied a version controlled project repository• Inspected the evolution of a software project• Evaluated a tool using artifacts (e.g., source code, logs, bug trackingissues) from one or more software projectsIntroduction to RepoGramsWe will now introduce RepoGrams and demonstrate the tool for you.At the beginning of the study we will introduce and demonstrate the tool to theparticipant, taking approximate 15 minutes. The rest of this section was not partof the questionnaire itself, but rather a summary of the topics that were coveredduring the demonstration as part of this section of the questionnaire.1. Big-picture description:• What does the tool do: RepoGrams visualizes information about gitrepositories. It takes git repositories, lays all the commits out ontoa single horizontal line as blocks, and colors each block a color thatcorrespond to a value of some chosen metric. A metric can representinformation from a variety of domains, such as features of the code(e.g., number of classes added/removed/modified) or of the develop-ment process (e.g., who made the commit). The tool helps the tool userinvestigate some aspects of git repositories or compare various featuresof the selected projects.• Imagine that you developed some SE tool and you want to write a pa-per on that tool. Before you can do that, you have to run an evaluationon it, and before you can run an evaluation you need to choose projectsto evaluate your tool on. We conducted a literature survey of SE papersand found that researchers rarely explain how they chose their evalua-tion targets or why they chose this subset and not another that wouldfit their criteria. We created the tool as a response to this problem, we100hope to show that RepoGrams can be used to help choose projects withstronger confidence.• Our main target audience is SE researchers, but the tool can also po-tentially be used by others, such as project leaders, managers, etc. . .2. Basic metaphor: [in this part, as we go through the concepts we will demon-strate them using the tool]• The basic metaphor of RepoGrams is as follows: Each line representsa single git project as it is represented in a single metric.• Each block represents a single commit, and the commits are laid out intemporal order, regardless of parent commit, from left to right. So thefirst commit in the project is at the far left and the latest commit is onthe far right. Between the different metrics the same block representsthe same commit, only in that different metric.• The block length can be changed to represent different things: it canrepresent how many lines of code have been changed in the commit(called churn in git), either comparable between the projects or incom-parable (to see an overview of all projects), or it can just be a fixedwidth if the churn does not matter.• The color of each commit represents its value in this specific metrics(see legend, description of metric)3. Basic interactions available in the UI: [in the section we guide the participantwith step-by-step instructions to play around with the tool]• Demonstrate adding 3 repositories• Demonstrate zoom and scroll• Demonstrate changing metrics, changing block length• Demonstrate swapping repository order, removing repository• Demonstrate loading of example data1014. Let the participant ask questions about the interface and suggest that theytake a minute to try it themselves. Explain that next up we will give them 3tasks to perform, and that during those three tasks they can ask us questions,but after those 3 tasks we will not be able to answer any question about theinterface or the tasks (except for clarifications if the instructions are unclear)Preparation tasksThis section includes 3 tasks. You can spend up to 5 minutes to finish these tasks(we will start the timer when you switch to the next page).While working on these tasks you may ask clarification questions. Note that inthe next section we will not be able to answer any questions regarding the interface.1. Preparation task 1• [POM Files ; metrics-first ; fixed-length ;https://github.com/sqlitebrowser/sqlitebrowser,https://github.com/coolwanglu/vim.js,https://github.com/mattgallagher/AudioStreamer,https://github.com/LightTable/LightTable,https://github.com/jch/html-pipeline]• Based on the current view, what is your estimate of the number of com-mits in the AudioStreamer project?– 1–9– 10–99– 100–999– 1,000–9,999– 10,000–99,999• (Based on question 2 from the undergraduate study)2. Preparation task 2• [Commit Localization ; metrics-first ; fixed-length ;https://github.com/sqlitebrowser/sqlitebrowser,https://github.com/coolwanglu/vim.js,https://github.com/LightTable/LightTable,https://github.com/jch/html-pipeline,https://github.com/GNOME/postr]102• Based on the current view, which project had the longest uninterruptedsequence of highly localized (0.881.00) commits?– sqlitebrowser– vim.js– LightTable– html-pipeline– postr• (Based on question 9 from the undergraduate study)3. Preparation task 3• [Number of Branches ; metrics-first ; lines-changed-incomparable ;https://github.com/sqlitebrowser/sqlitebrowser,https://github.com/coolwanglu/vim.js,https://github.com/LightTable/LightTable,https://github.com/jch/html-pipeline,https://github.com/GNOME/postr]• Based on the current view, which project appears to have a develop-ment process that is most similar to LightTable?– sqlitebrowser– vim.js– LightTable– html-pipeline– postr• (Based on question 11 from the undergraduate study)Main tasksLast change for questionsPlease take a bit of time to explore RepoGrams and ask clarifying questions.Beyond this point we will not be able to answer questions regarding the interface,the definitions of the metrics, and other elements of the tool.1034. Main task 1This task has a 3 minutes time limit.• [Most Edited File ; metrics-first ; fixed-length ;https://github.com/phusion/passenger-docker]• Based on the current view, which of the following is true?– There is a general UPWARDS trend to the metric values– There is a general CONSTANT trend to the metric values– There is a general DOWNWARDS trend to the metric values5. Main task 2This task has a 5 minutes time limit.• [POM Files ; metrics-first ; fixed-length ;https://github.com/facebook/css-layout,https://github.com/qiujuer/Genius-Android,https://github.com/JakeWharton/butterknife,https://github.com/AndroidGears/Plugin,https://github.com/pedrovgs/TuentiTV,https://github.com/ksoichiro/Android-ObservableScrollView,https://github.com/square/picasso,https://github.com/google/iosched,https://github.com/square/retrofit]• Categorize the projects into two clusters — one cluster containing projectsthat use Maven (include .pom files), and the other cluster with projectsthat do not use Maven• Note: Our online questionnaire included a subquestion here. However,during our postmortem we found that the question was understood am-biguously by a large number of participants. As a consequence weremoved that subquestion from our analysis and do not report on ithere.6. Main task 3This task has a 5 minutes time limit.104• [Branched Used ; metrics-first ; fixed-length ;https://github.com/munificent/wren,https://github.com/PHPMailer/PHPMailer,https://github.com/yahoo/pure,https://github.com/stympy/faker,https://github.com/mmozuras/pronto]• Categorize the projects into two clusters — one cluster containing projectsthat were developed on a single master branch before branching offto multiple branches, and the other cluster containing projects thatbranched off early in their development7. Main task 4This task has a 5 minutes time limit.• [Commit Author, Branches Used ; repos-first ; lines-changed-incomparable; https://github.com/JedWatson/touchstonejs,https://github.com/pblittle/docker-logstash,https://github.com/lukasschwab/stackit,https://github.com/arialdomartini/oh-my-git]• Categorize the projects into two clusters — one cluster containing projectsthat have a correlation between branches and authors, and the othercluster with projects that do not exhibit this correlation8. Main task 5This task has a 7 minutes time limit.• [Commit Author ; repos-first ; lines-changed-comparable ;https://github.com/lukasschwab/stackit,https://github.com/deployphp/deployer,https://github.com/sequenceiq/docker-ambari]• (a) Categorize the projects into two clusters — one cluster has projectsthat have one single obvious dominant contributor, based on number ofLINES OF CODE CHANGED, and the second group does not havesuch a contributor. A dominant contributor is one that generated closeto or more than 50% of the line changes in the project.Hint: you might have to zoom in and scroll.• Before the next question, switch to the Fixed block length mode105• (b) Categorize the projects into two clusters — one cluster has projectsthat have one single obvious dominant contributor, based on number ofCOMMITS, and the second group does not have such a contributor. Adominant contributor is one that generated close to or more than 50%of the commits in the project.Hint: you might have to scroll.Open-ended questions• Do you see RepoGrams being integrated into your research/evaluation pro-cess? If so, can you give an example of a research project that you coulduse/could have used RepoGrams in?• What are one or two metrics that you wish RepoGrams included that youwould find useful in your research? How much time would you be willing toinvest in order to write code to integrate a new metric?• In your opinion, what are the best and worst parts of RepoGrams?• Choose one of the main tasks that we asked you to perform. How would youhave performed it without RepoGrams?• Do you have any other questions or comments?C.1.4 Filtering resultsWe discarded the results from 1 participant who was disqualified for not havingany prior experience with repository analysis (i.e., said participant did not checkany boxes in Demographics question d.)C.2 Raw resultsHere we list the raw results. The difference background colors for cells denoteequivalence sets. Each equivalence set denotes how many participants respondedwith the same answer for each question.106Table C.1: Raw results from the user study with SE researchers.Demographics Preparation questions Main questionsP a b c d 1 2 3 4 5 6 7 8a 8b1 M Daily Faculty Y,Y,Y 10–99 sqlitebrowser html-pipeline upwards NNYNNNYYY LEELL YNYY NYY YYY2 M Once a week Masters student Y,Y,Y 100–999 sqlitebrowser html-pipeline upwards NNYNNNYYY LELLL YNNY NYN NYN3 M Once a week PhD student Y,Y,Y 10–99 sqlitebrowser vim.js constant NNYNNNYYY LEELL YNNY NYN NYY4 M Once a week Faculty Y,Y,Y 10–99 sqlitebrowser html-pipeline upwards NNYNNNYYY LEELL YNYY NYY NYY5 F Once a month Postdoc Y,Y,N 10–99 sqlitebrowser html-pipeline upwards NNYNNNYYY LEELL YNNY NYY NYY6 M Daily PhD student Y,Y,Y 10–99 sqlitebrowser html-pipeline upwards NNYNNNYYY LEEEL YNYY NYY YYY7 M Once a week Faculty Y,Y,Y 10–99 sqlitebrowser html-pipeline upwards NNYNNNYYY LEELL YNNY NYY NYY8 F Once a week PhD student N,N,Y 10–99 sqlitebrowser html-pipeline upwards NNYNNNYYY LEELL YNYY NYY NYY9 M Once a week PhD student Y,Y,Y 10–99 sqlitebrowser html-pipeline upwards NNYNNNYNY LEELL NYYY NYN NYN10 M Daily Masters student Y,Y,Y 10–99 sqlitebrowser html-pipeline downwards NNYNNNYYY LEELL NNYY NYY NYY11 M Once a week Faculty N,N,Y 10–99 sqlitebrowser html-pipeline upwards NNYNNNYYY LEELL YNNY NYY YYY12 M Once a week PhD student Y,N,Y 10–99 sqlitebrowser html-pipeline upwards NNYNNNYYY LEELL YNNY NYY NYY13 M Daily Faculty Y,Y,Y 10–99 sqlitebrowser html-pipeline upwards NNYNNNYYY LEELL YNYY NYY NYY14 M Daily PhD student Y,Y,Y 10–99 sqlitebrowser html-pipeline upwards NNYNNNYYY LEELL YNYY NYY NYYPercentage of participants who chose the most common answe 92.86% 100.00% 92.86% 85.71% 92.86% 85.71% 42.86% 78.57% 64.29%We do not report raw text responses to preserve the anonymity of the partici-pants.107Appendix DCase studyThis appendix contains meta-data and raw results for the case study to estimate theeffort involved in adding new metrics, described in Section 5.3.D.1 ResultsD.1.1 OverviewWe conducted a case study to estimate the effort involved in integrating new simplemetrics into RepoGrams with two developers.Dev1 is a computer science masters student who is the author of this thesis.Dev2 is a computer science fourth year undergraduate student.Both developers were not familiar with the codebase before they started theexperiment. Dev1 added 3 metrics between running the undergraduate user studyand running the user study with SE researchers, to be used in the latter user study.Dev2 added 3 metrics that were requested by some of the participants from the userstudy with SE researchers. Dev1 shared only the following details with Dev2:• Development environment setup instructions• The names of the 3 metrics that Dev1 added, to be used as starting points forthe exploration of the codebase108Both developers proceeded in a linear fashion, starting with setting up the de-velopment environment, then exploring the code, and finally developing, integrat-ing, and testing each metric individually. Both developers timed themselves as theyperformed each step.D.1.2 Raw resultsDevelopment environment setup time:• Dev1: 20 minutes• Dev2: 39 minutesExploration of the codebase:• Dev1: 10 minutes• Dev2: 40 minutesMetrics implementations1:Table D.1: Raw results from the case study to estimate the effort involved inthe implementation of new metrics.Dev Metric nameTime(min)LoCPythonLoCJavaScriptLoCJSONLoC HTML& CSSDev1 POM Files 30 7 7 16 0Dev1 Commit Author 52 12 26 730 33Dev1 Commit Age 48 8 30 50 0Dev2 Files Modified 42 7 7 16 0Dev2 Merge Indicator 44 8 7 16 0Dev2 Author Experience 26 17 7 16 01The number of LoC changed reported in Table D.1 might differ from those reported in Sec-tion 5.3. The codebase for RepoGrams was refactored between the case study and the writing of thisthesis as a result of the case study to facilitate the addition and implementation of new metrics. Aspart of the refactoring process, many metrics were rewritten to take advantage of these changes.109Appendix ELicense and availabilityRepoGrams is free software released under the GNU/GPL License [26]. Thesource code for RepoGrams is available for download on GitHub [53]. A runninginstance of RepoGrams is available at http://repograms.net/.110


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items