UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Investigating completeness and consistency of links between issues and commits Thompson, Casey Albert 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2017_september_thompson_casey.pdf [ 966.05kB ]
JSON: 24-1.0354567.json
JSON-LD: 24-1.0354567-ld.json
RDF/XML (Pretty): 24-1.0354567-rdf.xml
RDF/JSON: 24-1.0354567-rdf.json
Turtle: 24-1.0354567-turtle.txt
N-Triples: 24-1.0354567-rdf-ntriples.txt
Original Record: 24-1.0354567-source.json
Full Text

Full Text

Investigating completeness and consistency of linksbetween issues and commitsbyCasey Albert ThompsonB.S. Computer Science, University of California Irvine, 2010A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)August 2017c© Casey Albert Thompson, 2017AbstractSoftware developers use commits to track source code changes made to a project,and to allow multiple developers to make changes simultaneously. To ensure thatthe commits can be traced to the issues that describe the work to be performed,developers typically add the identifier of the issue to the commit message to linkcommits to issues. However, developers are not infallible and not all desirablelinks are captured manually. To help find and improve links that have been man-ually specified, several techniques have been created. Although many softwareengineering tools, like defect predictors, depend on the links between commits andissues, there is currently no way to assess the quality of existing links. To provide ameans of assessing the quality of links, I propose two quality attributes: complete-ness and consistency. Completeness measures whether all appropriate commitslink to an issue, and consistency measures whether commits are linked to the mostspecific issue. I applied these quality attributes to assess a number of existing linktechniques and found that existing techniques to link commits to issues lack bothcompleteness and consistency in the links that they created. To enable researchersto better assess their techniques, I built a dataset that improves the link data fortwo open source projects. In addition, I provide an analysis of information in issuerepositories in the form of relationships between issues that might help improveexisting link augmentation techniques.iiLay SummarySoftware developers use commits to keep track of changes they make to a softwareproject. To ensure that the commits can be traced to the issues that describe thework to be performed, developers manually create links between the commits andthe issues. However, developers sometimes forget to manually create the links.To help find and improve links, several techniques have been created. Althoughmany software engineering assistance tools depend on the links, there is currentlyno way to assess the quality of created links. To evaluate techniques that createlinks, I identify a way to assess the quality of the created links. To enable futureresearchers to better assess their techniques, I built a dataset that improves thelink data. I finish with an analysis of issues-to-issue relationships that might helpimprove existing linking techniques.iiiPrefaceAll of the work presented in this thesis was conducted in the Software PracticesLab at the University of British Columbia, Point Grey campus.The development of the dataset in Chapter 4 was done by A. Marques, andmyself. A. Marques donated days as the second researcher to double-check thedataset that was dependably developed.Chapter 5 of the research presented in this thesis has been previously publishedin the article:1. C. Thompson, G. Murphy, M. Palyart, M. Gasˇparicˇ. How Software Devel-opers Use Work Breakdown Relationships in Issue Repositories., In Interna-tional Conference on Mining Software Repositories, MSR, pages 281—285,2016ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Background and Related Work . . . . . . . . . . . . . . . . . . . . . 52.1 Background of Software Process Data . . . . . . . . . . . . . . . 52.1.1 Version Control Systems . . . . . . . . . . . . . . . . . . 52.1.2 Issue Repositories . . . . . . . . . . . . . . . . . . . . . 62.1.3 Issue Relationships . . . . . . . . . . . . . . . . . . . . . 92.2 Existing Techniques to Link Commits to Issues . . . . . . . . . . 102.2.1 Manual Linking by Developers . . . . . . . . . . . . . . . 102.2.2 Traditional Heuristics . . . . . . . . . . . . . . . . . . . . 11v2.2.3 State-of-the-Art Techniques . . . . . . . . . . . . . . . . 122.3 Research in Understanding Issue Repositories . . . . . . . . . . . 132.3.1 Research that Studies Issue Tracking Systems . . . . . . . 142.3.2 Approaches that Target Issue Tracking Systems . . . . . . 142.3.3 Software Process Work . . . . . . . . . . . . . . . . . . . 153 Assessing the Quality of State-of-the-Art Techniques . . . . . . . . . 163.1 Completeness and Consistency . . . . . . . . . . . . . . . . . . . 173.1.1 Completeness . . . . . . . . . . . . . . . . . . . . . . . . 173.1.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Performance of State-of-the-Art . . . . . . . . . . . . . . . . . . 203.2.1 Identifying Projects . . . . . . . . . . . . . . . . . . . . . 213.2.2 Selecting Projects . . . . . . . . . . . . . . . . . . . . . . 213.2.3 Technique Modifications . . . . . . . . . . . . . . . . . . 223.3 Comparison of State-of-the-Art Techniques . . . . . . . . . . . . 243.3.1 Completeness . . . . . . . . . . . . . . . . . . . . . . . . 243.3.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . 253.3.3 Summary of Quality Assessment . . . . . . . . . . . . . . 294 Creating a Complete and Consistent Dataset . . . . . . . . . . . . . 304.1 Existing Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2 Dataset Creation Process . . . . . . . . . . . . . . . . . . . . . . 314.2.1 Projects Selected . . . . . . . . . . . . . . . . . . . . . . 324.3 Link Identification . . . . . . . . . . . . . . . . . . . . . . . . . 344.3.1 Preprocessing Dataset . . . . . . . . . . . . . . . . . . . 344.3.2 Identification of Links . . . . . . . . . . . . . . . . . . . 354.3.3 Dataset Validity . . . . . . . . . . . . . . . . . . . . . . . 374.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 395 Semantic Relationships between Issues . . . . . . . . . . . . . . . . 415.1 Qualitative Study . . . . . . . . . . . . . . . . . . . . . . . . . . 425.1.1 Coding Process . . . . . . . . . . . . . . . . . . . . . . . 425.1.2 Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43vi5.1.3 Threats to Validity . . . . . . . . . . . . . . . . . . . . . 465.1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . 486.1 Improving Links of Commits to Issues . . . . . . . . . . . . . . . 486.2 Techniques to Categorize Issue Relationships . . . . . . . . . . . 496.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 497 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56A Links examined for consistency . . . . . . . . . . . . . . . . . . . . . 63viiList of TablesTable 2.1 Relationships instances . . . . . . . . . . . . . . . . . . . . . 10Table 3.1 Statistics of the projects selected . . . . . . . . . . . . . . . . 22Table 3.2 The number of commits and issue in 2016, and % completenessof manually stated links by developers . . . . . . . . . . . . . 23Table 3.3 The % completeness for existing work and % improvement (Improv.)of linking relationship completeness of each state-of-the-art tech-nique on selected projects. . . . . . . . . . . . . . . . . . . . . 25Table 3.4 Relationships instances in each project that have an issue linkedto a commit. Bold numbers are relationships that represent workbreakdown in the issue repository . . . . . . . . . . . . . . . . 26Table 3.5 Results of manually examining consistency of up to five linksfrom each project. . . . . . . . . . . . . . . . . . . . . . . . . 28Table 4.1 Commit and issue information for each project. . . . . . . . . 32Table 4.2 Comparison of relationship types for the window selected com-pared to the issue repository as a whole for both Connect andSonarqube . . . . . . . . . . . . . . . . . . . . . . . . . . 33Table 4.3 Results of linking process in identifying links commits to issues 38Table 5.1 Codes occurring in each repository . . . . . . . . . . . . . . . 46Table 6.1 Example of transformation using a restricted language. . . . . 50viiiTable A.1 Links examined for consistency evaluation in Chapter 3. Nochanges to the links were made by Phantom or Loaner. . . . . 64ixList of FiguresFigure 1.1 The dotted line is a recovered link between issue and commit.The solid line is an existing relationship. . . . . . . . . . . . . 2Figure 2.1 An example of a commit with the link CONN-1190 (high-lighted in yellow) manually inserted by developers. . . . . . . 7Figure 2.2 Gateway-3941 an issue from Jira issue repository for the conn-ect open source project . . . . . . . . . . . . . . . . . . . . . 8Figure 3.1 Examples of complete and consistent linking relationships fortwo projects. Circles are commits in a version control system.Squares are issues in an issue repository. Dotted lines are rela-tionships between issues. Solid lines are links between issuesand commits . . . . . . . . . . . . . . . . . . . . . . . . . . 18Figure 5.1 Codes developed through open coding . . . . . . . . . . . . . 43Figure 6.1 Example of categorizing with a supervised machine learningalgorithm that learned to label issue pairs as check validity ifthe child issue the contains the word test. . . . . . . . . . . 52xAcknowledgementsIn the institutional inauguration, Susan E. Sim and Rosalva Gallardo Valencia wereinstrumental in instructing an immature yet inquisitive initiate in investigations.Invariably, I’m infinitely indebted.Cherished and charming champion called Ashely V. Port constantly contributedcaring criticism. Ashley is a companion who cheerfully contributed to Casey’scommendable caliber and constantly comforted when consolation commanded.An appreciative Albert acknowledges an accommodating advisor, Gail C. Mur-phy. Always assiduous at all advisory aspects, and an apodictic advocate at assist-ing in analytical areas.Time to thank the Thompsons; through the turbulent times their therapeutictalk took tolerance and thriftiness. They taught that training takes toughness andtechnique to terminate tasks.Steadfast people! Language sometimes packs lashes. Smiling pleased, let mescript pleasant letters for Software Practice Labs splendid pals, lifelong. So pas-sionate, leaving a smart powerful lab. Some premium last words, Reid Holmesshowered positive learning, and simply put laudable Elisa Baniassad shared prac-tical lessons.xiAt all who affectionately assisted my achievements. Alas advancement!xiiChapter 1IntroductionTo assist developers in creating complex software, a number of software develop-ment tools have been created, including traceability [21], defect prediction [27]and time to completion [15]. These tools rely on underlying software process data,including issues in an issue repository, source code changes (i.e., commits) in a ver-sion control system, and links between commits and issues. A link from a committo an issue indicates which artifacts—source code, documentation, and others—were changed in order to complete the work or defect described in an issue. Some-times, the work required to complete an issue is split over multiple commits. It ispossible that a commit can link to multiple issues, and an issue can link to multiplecommits.The state-of-practice in linking consists of developers manually stating the idof the issue being addressed in commit messages of a commit. Developers alsouse tool support such as Mylyn1, or Jira add-ons2 to link commits to issues. Thestate-of-practice does not link all commits and issues, Bird and colleagues ana-lyzed seven different projects and found on average 63% of fixed the issues in theprojects they examined are not linked to a commit [8]. State-of-the-art techniquesin linking commits to issues try to improve state-of-practice using the existing link-ing relationship to identify and link commits and issues that are not linked.Links between commits and issues are used in a number of tools that do de-1www.eclipse.org/mylyn, verified 13 July 20172Git Integration for JIRA, Commit Policy for JIRA, Jenkins Plugin1(a) Existing technique to recovering link (b) The correct link for the commitFigure 1.1: The dotted line is a recovered link between issue and commit.The solid line is an existing relationship.fect prediction [9, 23, 27, 33, 49]. If the association, or links, between the commitdata and the issues are not perfect, bias can result. A consequence of bias is thatdevelopers can be informed of defective code that is actually not defective [8].As an example in Figure 1.1a, a technique to automatically link commits to is-sues, called Phantom [45], creates a link between commit 14cd8d6b3 and theissue CAMEL-7354 but this issue is not the most relevant to the commit. In-stead Figure 1.1b shows, the link should exist between commit 14cd8d6b3 andCAMEL-7675; an inspection of CAMEL-7354 would revel commit 14cd8d6b3is more relevant to the sub-task CAMEL-7675, and the commit should link to thatissue.The goal of this thesis is to improve the quality of software process data in theform of links between commits and issues. In order to improve the quality, thereneeds to be a means to assess quality. However, there does not exist a standardizedapproach to such assessment in the community. To provide a means of assessment,I introduce the quality attributes of completeness and consistency. I define com-pleteness as the extent to which all commits should link to an issue in the issuerepository. I define consistency as the extent to which commits should link to themost relevant issues. I show that existing techniques that aim to improve links be-tween commits and issues do not fully address the attributes of completeness andconsistency (see Chapter 3).Given that existing data, drawn from open source projects, does not providehigh levels of completeness or consistency, I create a dataset that improves theunderlying software process data for two open source projects (see Chapter 4).2This dataset is created manually by identifying all the issues that a should be linkedto a commit. The dataset provides a means for researchers to test techniques forimproving links between commits and issues against a well-formed dataset.Given that existing state-of-the-art techniques do not create fully complete andconsistent links, there is room to improve these techniques. One kind of data thathas not yet been considered in these techniques is the structure that exists betweenissues in issue repositories in the form of relationships between issues. Develop-ers use relationships between issues to hold information about how functionalityin a system is related or how work is broken down. Using relationships in issuerepositories can be challenging because different kinds of relationships are usedfor different projects, even when the same underlying issue repository technologyis used. For instance, an Apache developer related when discussing issue relation-ships:it’s reasonable to use is related to. Or probably we can make it isrequired by, but I’m not so sure3.Ambiguity among the meanings of relationship types can lead developers to incor-rectly label relationships. To explore the kinds of relationships developers use, Ipreformed a study of the issues in an issue repository to reveal underlying mean-ings of issue relationships by categorizing relationship types in three issue reposi-tories. The study focused on work breakdown relationships, as those were the mostprevalent among all three issue repositories. The study found there are indeed un-derlying categories of meaning for work breakdown relationships (see Chapter 5).In future work, researches might build on theses categories to help improve thecompleteness and consistency of the links between commits and issues.1.1 ContributionsThis thesis makes three contributions:1. Existing techniques to link commits to issues are shown to be incompleteand inconsistent. Although a number of techniques have been proposed toimprove links between commits and issues in a project, these techniques3From personal correspondence. Emphasis added.3may still not produce as accurate a result as needed for software engineeringtechniques, like defect prediction, that build upon the information.2. To help support the development and evaluation of new techniques to im-prove link data between commits and issues, I created a dataset consisting ofa portion of two open source projects. The creation of the dataset followed aprocess to make the dataset as complete and consistent links as possible.3. The kinds of relationships found in issue repositories is investigated, with anidentification of the meaning of work breakdown relationships and a methodto semantically categorize work breakdown relationships.1.2 RoadmapI begin by describing background about how developers use version control sys-tems and issue repositories, followed by related work on techniques to link com-mits to issues and earlier efforts in characterizing issue repositories (Chapter 2). Ithen present a definition of two quality attributes completeness and consistency andthen assess existing techniques with respect to those attributes (Chapter 3). Next, Idescribe the process necessary to create a dataset, followed by the creation of thedataset (Chapter 4). I then describe a qualitative study to understand and categorizethe relationships that exist between issues (Chapter 5). I also discuss how a betterunderstanding of relationships can improve approaches to link commits to issues(Chapter 6) before concluding (Chapter 7).4Chapter 2Background and Related WorkThis chapter provides background on the software process data considered in thisthesis and reviews related work about the structure and contents of issue repos-itories. The latter is relevant to the use of structure in issue repositories that isproposed as a means of improving software process data.2.1 Background of Software Process DataThe software process data examined in this thesis comes from two sources, versioncontrol systems and issue repositories. A version control system contains changesmade to a project by developers. An issue repository contains issues used to trackthe many parts being worked on in a project. In addition to issue repositories, Iexamine the issue relationships and how they are used by developers2.1.1 Version Control SystemsDevelopers use version control systems to track source code changes made to aproject and to allow multiple developers to make changes at the same time. Manydifferent types of version control systems (VCS) exist, sich as git1, svn2, and cvs3.Principles, such as whether there is a master repository [53] or whether the VCS isdistributed as in git, despite being based on different principles, all have the central1git-scm.com, verified 13 July 20172subversion.apache.org, verified 13 July 20173savannah.nongnu.org/projects/cvs, verified 13 July 20175goal of maintaining a history of all changes to source code. Each individual changeis called a commit. In addition each comment has meta-data, such as the author ofthe change, timestamp and a textual message about the commit. In this thesis, I useGitHub4 as the forge to study to find open source projects, as GitHub offers simplesearch features to identify open source projects that meet desired specifications andall projects hosted on GitHub use git.Figure 2.1 is an example of a commit in git for the open source project Conn-ect5. The top of Figure 2.1 (1) lists all of the commits in git for the project. Thebottom half provides details for the specific commit that is highlighted in the listof all commits. The subject of the commit (2) contains a short description of thereason for the commit. Underneath the subject (3) is the person who is consideredthe author of the commit and the date the commit was made. The author canmanually, or sometimes through tool support add information about which issuethe commit addresses. In this commit, the commit message identifies that the issue“CONN-1190” is addressed. On the right side (4) is the SHA that is the resultof using a hash algorithm on the changes in the commit. The SHA is a uniqueidentifier used by the VCS to verify that the data is not corrupt and by developersto uniquely identify the commit. The full description of the commit includes theshort description and other text that the developer wants to include; this descriptionis called the commit message (5). The files that are changed in the commit are listedin (6): these are any files that have modifications, renaming, additions, or deletions.The bottom shows changes that occur in a file (7): the changes are represented bythe lines added and/or deleted and are referred to as a diff.2.1.2 Issue RepositoriesDevelopers use issue repositories to keep track of work and communicate withothers [6]. Issue repositories contain multiple types of issues; each type typi-cally contains different information. This information can include what a systemshould do, who works on different parts of a system, what bugs are being fixed,and more [5, 24, 33].4github.com, verified 13 July 20175connectopensource.com, Connect is a project supporting health information exchange, verified13 July 20176Figure 2.1: An example of a commit with the link CONN-1190 (highlightedin yellow) manually inserted by developers.There are many different implementations of issue repositories, some focusedon identifying bugs and some focused on tasks to build a system. An installationof an issue repository typically allows a project to set various configuration pa-rameters to help fit the repository to the needs of the project. Despite differencesbetween issue repositories, many commonalities exist. Figure 2.2 is an exampleof an issue (GATEWAY-39416) in the issue repository for the open source projectConnect. I describe the relevant parts of an issue that are common amongst mostissue repositories and that are used in this thesis to identify links between issuesand commits. The top bar (1) shows the issue id with the title of the issue. On theright hand side (2), are people involved with the issue and important dates for theissue. Underneath the title (3) are details that a developer uses to understand thescope of the issue with respect to the system and to find the status of the issue. Thedescription of the issue informs the assigned developer details necessary to com-plete work on the issue (4). Issue relationships provide links to issues that havesome association with the issue (5) and (6). The bottom contains communicationswith other developers, and other activities to help keep track of changes to the issue(7).6connectopensource.atlassian.net/browse/GATEWAY-3941, verified 13 July 20177Figure 2.2: Gateway-3941 an issue from Jira issue repository for the connectopen source project82.1.3 Issue RelationshipsDevelopers use relationships between issues to hold information about how func-tionality in a system is related or how work is broken down. The ways in whichdevelopers use relationships in issue repositories can reveal the degree to whicha project follows a particular process or to which a project tracks dependenciesbetween system functionality, amongst others. Figure 2.2 shows the issue relation-ships, (5) and (6), for the issue (GATEWAY-3941). For example in this issue, oneof the relationships developers use is the ‘depends on’ relationship to describe theimplementation of the issue (GATEWAY-3941) depends on research of the issue(GATEWAY-3895).Developers have a limited amount of time to perform many different activi-ties, such as planning, developing, and testing. Even with limited time, developerschoose to spend manual effort creating relationships between issues. Across threemajor open source projects, Mylyn, Connect, and HBase7, a large percent-age of issues have relationships, an average of 35% per repository, even thoughthey use different issue repositories8. Connect has the largest percentage at 75%which means that many of the issues in this project have a relationship to anotherissue. The large percentage of issue relationships means that developers use therelationships to capture information on which they rely.Developers use a variety of different types of relationships in issue repositories.In Bugzilla9, the default issue relationships that are used are depends-on andduplicates. In Jira10, there are five default relationships, which are relatesto, duplicates, blocks, clones, and sub-task. When JIRA is usedfor a project following an agile project management style, it is also common to seesuch relationships as supported-by and part-of-epic.Table 2.1 shows the many types and instances of relationships for three opensource systems: Mylyn, Connect and HBase11. There is common meaning7hbase.apache.org, verified 13 July 20178HBase uses Jira and has 4731/13334 issues with a relationship. Connect uses Jira and has3496/5154 issues with a relationship. Mylyn uses Bugzilla and has 4956/13009 issues with a rela-tionships9bugzilla.org, verified 13 July 201710atlassian.com/software/jira, verified 13 July 201711The data reported is from April 30, 20159Table 2.1: Relationships instancesMylyn Connect HBaseBlocked 299Breaks 47Clone 32Contains 12Depends-on 2174 205 286Duplicates 1520 43 159Incorporates 208Part-of-epic 559Requires 189Relates-to 1901Subtask 1643 1368Supercedes 31Supported-by 1361Total relationships 3694 3811 4532Total issues 12959 5154 13334among the different relationship types across different repositories. As an exampleof common meaning in relationships, Bugzilla’s depends-on has similar mean-ing to Jira’s sub-task.2.2 Existing Techniques to Link Commits to IssuesIn general, most linking consists of developers manually stating links as part oftheir development process. A number of automated tools have also been createdto help automate this step and to improve linking data from existing repositoryinformation. I describe the how developers manually state issues in commits, andtwo approaches, traditional heuristics, and state-of-the-art techniques.2.2.1 Manual Linking by DevelopersFor many projects, developers follow company or project development practices,by manually inserting issue ids in commit messages and commit SHAs into issues.These practices are either explicitly written in a contributor’s guide, such as found10in the Connect project contributors guide12, or are implicit practices found bylooking at commit messages. The system of manually stating the issue id is reliantupon developers never forgetting to add the issue id and exactly matching the issuename and id. Bird and colleagues show that developers do forget to state issue idsin commit messages, resulting in incomplete liking data [4].As an example, for the Connect project, a developer stated issue id in thecommit message is expected to reference the issue in a fixed format in the com-mit message. An example of a manual link is in Figure 2.1 above, this shows theauthor (developer) (2) wrote the commit message (5) “CONN-1190: Rebase con-flict resolved for glassfish-web.xml.” The author manually included the issue id“CONN-1190” in the commit message. These issue ids follow a pattern that isunique to each issue repository, for example in Connect the pattern is “CONN-”followed by a number.Other tool support, such as Mylyn13, exists to automatically create commitmessages with the issue id. This tool is a plugin to the open source Eclipse IDE14and allows developers to track automatically source code that is relevant to theissue being addressed. The tool requires a developer to manually indicate whatissue is being worked on. When a developer is finished working on an issue andmakes a commit, the tool inserts the issue id and title of the issue as the commitmessage. Use of such tooling increases then number of commit messages with anissue id, making later retrieval and identification easier.2.2.2 Traditional HeuristicsEarly work attempts to coalesce both an issue repository and version control sys-tem by creating links between issues and commits. The work relies on developersmanually stating the issue id in the commit message. I refer to this as traditionalheuristics as this is the first method used to link issues and commits. Traditionalheuristics uses patterns to link commits to issues, looking in the commit messagefor key words like “fix” or “bug” in combination with an issue id or finding issue12connectopensource.atlassian.net/wiki/display/CONNECTWIKI/How+to+Contribute+Code, ver-ified 13 July 201713eclipse.org/mylyn, verified 13 July 201714eclipse.org, verified 13 July 201711ids in a fixed format like “CONN-12”. These techniques minimize false positivesby removing links from commits to non-existing issues in the issue repository, andremove links where the commit is made more than 7 days before or after an issue issolved. Multiple papers use traditional heuristics to implement tools such as defectprediction, identify changes that introduce bugs, or predicting when an issue willbe reopened [18, 21, 23, 27, 33, 49, 57].There exists tool support to automatically create links from issues to commits,such as add-ons for Jira. These add-ons scan commit messages in a version controlsystem and extracts issue ids that match the issue ids found in the issue repository.Any deviation from an exact issue match would cause the tool to miss a potentiallink. Examples of Connect commit messages that would not match are “CON-20 Updated PDP”, and “CONNECT-1368: More updates.” Although these commitmessages reference an issue, the tool would not create a link because the issue idsdo not match the any issue in the issues repository.2.2.3 State-of-the-Art TechniquesRecent research in linking commits to issues primarily uses information retrieval(IR) techniques that are not dependent on the developer stated issue id in commitmessages [10, 32, 35, 45, 56]. There are three approaches state-of-the-art tech-niques used to link commits to issues: the first is to use similarity of the text in thecommit and issues, the second to use only heuristics based on authorship and time,and the last to use machine learning to create the links.Wu and colleagues were the first to use an IR based linking technique (Relink);using similarity of the text in a commit and issue rather than being based off iden-tifying issue ids in commit messages [56]. Relink uses three criteria for linking acommit to an issue: similarity of the commit message to the text in the issue title,description, and comments; the time between an issue comment and a commit; theauthor of the commit must be assigned to the issue. Relink uses the links identifiedwith heuristics to set a minimum threshold for similarity of text between a com-mit and issue, and time between a issue comment and a commit necessary to linkcommits to issues.Nguyen and colleagues build on Relink to create a technique called Mlink, the12technique uses code elements to identify links between issues and commits [35].Mlink identifies code elements in issues and compares them to the code changesin a commit. In addition to using Relinks techniques, Mlink uses both commitmessage and comments added in the code and compares them to the text in theissue title, description and comments. Mlink differs from Relink as it does not useauthorship to determine if a commit links to an issue. Mlink optimizes thresholdsusing the same process as Relink.Bissyande and colleagues technique differs from both the Relink and the Mlinktechniques by using only similarity of the text in commits and issues to link com-mits to issues [10]. They use three techniques to identify text in issues and com-mits: vector space modeling (VSM), latent semantic analysis (LSA), and latentDirichlet allocation (LDA). In the end Bissyande and colleagues find that only us-ing text similarity performs worse both Relink and Mlink.Loaner and Phantom [45] use heuristics with time and authorship to determineif a link should be made. Unlike Relink, these techniques do not use text simi-larity. The Phantom technique identifies issues that have existing linked commitsand finds similar commits based on files changed to link to the issue. The Loanertechnique links commits to issues that have no existing link, and differs from Re-link as it does not require text similarity between issues and commits. Loaner andPhantom optimizes thresholds using the same process as Relink.RCLink [32] uses machine learning to link commits to issues. Le and col-leagues, the authors of RCLink, use machine learning to create links betweencommits and issues. The technique creates features from information found inissues and commits for their machine learning technique. Le and colleagues re-port that RCLink performs better than Relink [56], Bissyande and colleagues tech-nique [10], and MLink [35].2.3 Research in Understanding Issue RepositoriesI propose in this thesis the idea of using structure in issue repositories, as found inrelationships between issues, to improve software process data. I provide in thissection an overview of related research in issue repositories.132.3.1 Research that Studies Issue Tracking SystemsA number of researchers have investigated and described characteristics of issuesand issue repositories. Mockus and colleagues characterize aspects of problem re-ports (i.e., issues), such as time to resolve a problem, as part of a characterizationof open source development in the Mozilla15 and Apache16 projects [33]. Anvikand colleagues characterize the Eclipse17 and Firefox18 repositories describing bugtriage and duplicate bug problems that happen in open source repositories [1]. Koand colleagues analyzed titles of individual issues to determine such aspects as thedegree to which issues refer to particular parts of a system and how much regularitythere is between issue titles [29]. Bettenburg and colleagues consider what addi-tional information should be included in a issue to assist a developer [6]. Banerjeeand colleagues examined Mozilla and Eclipse repositories, finding that the matu-rity of a reporter reduces how often insignificant, poor quality, and duplicate bugsare detected [5]. Jankovic and colleagues find issues and commits can be used toreconstruct software processes; they define issues as parallel or sequential with theexistence or nonexistence of a “block” relationship link between issues [24].2.3.2 Approaches that Target Issue Tracking SystemsOther research efforts use the information in an issue repository to improve soft-ware development. Sandusky and colleagues examined and grouped reports inan issue repository to see if groupings could improve the management of prob-lems [44]. Cubranic and Murphy [17] and Anvik and colleagues [2] applied ma-chine learning to categorize issues automatically and to automate such bug triageactivities as assigning the issue to an individual for attention. Runeson and col-leagues [43] applied natural language processing methods to automatically recom-mend duplicate issues. Wang and colleagues use a combination of natural languageprocessing execution information to improve recall detection of duplicate bug re-ports [55]. Rocha and colleagues recommend the next bug to work using cosinesimilarity to find similar bug [42]. Two works of which I am aware make use15mozilla.org, verified 13 July 201716apache.org, verified 13 July 201717eclipse.org, verified 13 July 201718mozilla.org, verified 13 July 201714of relationships between issues. The first by Rastkar and Murphy describes howextractive summarization can be applied to produce summaries of linked issues au-tomatically [39]. The second by Choetkiertikul and colleagues uses explicit andimplicit relationships between issues to predict delays in a software project andshows that relationship information can help to more accurately predict delays insoftware projects [15]. By characterizing relationships between issues, I aim toidentify more opportunities for using relationship information to improve softwaredevelopment practices.2.3.3 Software Process WorkInitial research in software process attempted to prescribe a software process modelto developers, which is a formal representation of the process to be used in devel-opment [11]. Others have considered how to interpret software data from VCSor issue repository information. Cook and Wolf introduced automatic identifica-tion of formal software process models though event logs generated from data col-lected from version control systems and issue repositories. Cook and Wolf describethree techniques using event log data to construct a UML like process model, usingstatistics, Markov chain, or neural networks [16]. Kindler and colleagues use theevents in a version control system to generate event logs, in combination with adeveloper annotating the purpose of each commit, to generate UML like processmodels from event logs [28]. Hindle and colleagues use version control systems,issue repositories, and mailing lists to generate activity timelines showing softwareprocess [22]. Duan and colleagues create dynamic project specific UML like soft-ware process models by making assumptions on the purpose of files in a versioncontrol system to automatically generate event logs [19]. Research specific to eventlog creation has focused on how data can be extracted from multiple repositories inorder to enable process model generation reuse. Poncin and colleagues introduceFraser a tool to automate event log creation from any number of version controlsystems and issue repositories. They show that by separating event log extractionand analysis steps into distinct phases allows reuse of analysis tools [36]. These ap-proaches look at high level aspects of software process in version control systemsand issue repositories to create process models.15Chapter 3Assessing the Quality ofState-of-the-Art TechniquesA growing number of software engineering techniques rely on software processdata. As described in the introduction, one particular form of data that is reliedupon is the association (or linking relationship) of commits to issues. Despite thewidespread use of this data, there does not exist a standardized means to assessthe quality of the data. In this chapter, I introduce two quality attributes to assessthe linking relationship: completeness and consistency. A linking relationship iscomplete when all commits link to an issue, and is consistent when commits arelinked to the most specific issue.If software process data lacks completeness or consistency, then the techniques,which rely on that data, may produce mis-leading results. From an evaluation per-spective, a lack of completeness or consistency can be evaluated with potential bias,i.e., the recall (due to completeness) and the precision (due to consistency) of thesetechniques may be incorrect. To assess the degree to which existing techniquesaimed at improving the linking relationship achieve completeness and consistency,I assessed four state-of-the-art techniques. My method of assessment involved se-lecting 10 open source projects, and running the four state-of-the-art techniques onthe 10 projects to identify links for commits not linked to an issue. Using the re-sults of state-of-the-art techniques I show that each technique does not create fullycomplete and consistent linking relationship. In addition, the existing linking rela-16tionship in each project is neither complete or consistent which is needed in orderto compare techniques and create a new technique to link commits to issues.3.1 Completeness and ConsistencySoftware techniques, such as defect prediction, traceability, and prediction for timeto completion of issues rely on the link relationship between issues and com-mits [15, 21, 27]. The linking relationship is defined by set L, where L is a setof pairs (links) (i,c), i is an element (an issue) in the issue repository I, and c is anelement (a commit) in the VCS C (see equation 3.1.)L⊆ (i,c) ∈ I×C (3.1)Each link expresses which code and other artifacts were changed as part of anissue (i.e. a link from an issue to a commit or commits.) If many links in theset L are missing or are not accurate, then the software techniques can produceincorrect results which are discussed later in this section. L is a many-to-manyrelationship: commits can link to none or multiple issues, and an issue can link tonone or multiple commits. To help assess the quality of L for a given project, Iintroduce the concepts of completeness and consistency.3.1.1 CompletenessI consider L to be complete when all commits are linked to one or more issues.Formally defined L is complete when Complete(L) is true (see equation 3.2.)Complete(L)⇔∀c ∈C.∃(i,c) ∈ L (3.2)Meaning that for every commit c in the VCS C there exists a link (i,c) in L.Figure 3.1a shows an example of a completely linked project. The project has allcommits linked to one or more issues in an issue repository. The completeness fora project Complete(L) is assessed quantitatively using the percentage of commitslinked to issues.A stronger definition of completeness would place a constraint that all issuesappear in L. However, in most issue repositories, there exist issues that describe17(a) Completely linked project (b) Consistently linked projectFigure 3.1: Examples of complete and consistent linking relationships fortwo projects. Circles are commits in a version control system. Squaresare issues in an issue repository. Dotted lines are relationships betweenissues. Solid lines are links between issues and commitswork or intended aspects of a project that do not result in changes to artifacts; assuch, this constraint could never be fully met. Some reasons that issues can be cre-ated which do not address source code changes are to provide potential directionsthat a project might take, or to provide customer support for a part of a project [1].Due to the many potential reasons to create an issue, linking all issues to a commitis not likely to occur, and is not investigated for this thesis.As an example of the importance of completeness of L consider defect predic-tion, which relies on the identification of files that often have bugs. The messageaccompanying a commit—known as a commit message—and the documentationin changed artifacts, may not provide enough detail to automatically determine thereason behind the commit (i.e., bug fix, feature implementation, etc.) Defect pre-diction uses L to identify why a change happened and to improve the predictionalgorithm. An incomplete L, where all commits are not linked to an issue, cancause the defect predictor to give less accurate results [8].183.1.2 ConsistencyMultiple issues might be used to describe the source code changes in a commit.These issues may have relationships between themselves, which are manually cre-ated by developers as explained in Chapter 2.1.3. L is consistent when commitslink to the most specific issues, for those issues involved in a relationship. By spe-cific, I mean the issue’s text describe the reasoning for the source code changes inthose commits. An issue relationship is defined by R which is a set of pairs ii, i jwhere both ii and i j are in the set of issues I, and there is a relationship between iiand i j (see equation 3.3.)R⊆ (ii, i j) ∈ I× I (3.3)A subset of links L′ involve issues that have relationships to other issues, andis used to determine the consistency of L. L′ is a subset of the links in L containingpairs (i′,c) where i′ is in either side of a relationship in the set R.L′ = {(i′,c) ∈ L|(i′, i) ∈ R or (i, i′) ∈ R} for some i′ ∈ I (3.4)I focus on set L′, because the issues in these links are part of a relationship anddevelopers may manually state a link, as part of their development process, to acommit to a less specific issue that has a relationship. As an example of linking toa less specific issue consider Figure 3.1b where issues B and C have a relationshipto issue A, and the intent of source code changes in commit 1 are described in issueA. If a developer links commit 1 to issue C, the link is to a less specific issue andthe link should be changed to issue A, which is the most specific issue.To determine if a commit is linked to the most specific issue in a relationship,I define a function f that takes a link, and the set R and returns true if a commitc is linked to the most specific issue i′ (see Equation 3.5.) L is considered to beconsistent when the function Consistent(L) is true. This means L is consistent ifthe function f determines that all the commits in L′ are linked to the most specific19issues (see Equation 3.6.)f ((i′,c),R) =true if c is linked to the most specific i’f alse otherwise (3.5)Consistent(L)⇔∀(i′,c) ∈ L′. f ((i′,c),R) (3.6)Figure 3.1b shows an example of a consistently linked project. Each commitis linked to the most specific issue in the project. Function f is done manually tocheck each link and identify if the commit is linked to the most specific issue, iffunction f returns false L is considered to be inconsistent.Linking commits to the most specific issue is important for software tools thatpredict delay in software projects [15]. These tools extract features from links, anduse those features to train machine learning techniques to predict delay. Choetkier-tikul and colleagues predict delays using the time difference between when an issueis opened and closed and the files changed, among other features. If a commit islinked to a less specific issue the time between open and closed may be differ-ent and cause the estimation to be inaccurate. Having commits linked to the mostspecific issue may improve time to completion estimations.3.2 Performance of State-of-the-ArtTo demonstrate some of the problems with software process data, I evaluated thecompleteness and consistency of four state-of-the-art techniques across 10 opensource projects that have an issue repository with issue relationships. The fourtechniques examined represent three approaches techniques used to do linking: Re-link [56] which uses similarity of the text in commits and issues, Loaner, and Phan-tom [45] which uses heuristics based on the authorship and time, and RCLink [32]which uses machine learning. Completeness for each technique is evaluated quan-titatively by reporting on the percentage of commits linked to an issue. Consistencyis evaluated qualitatively, by manually examining the results of each technique todetermine based on the semantics of each commit whether the commit is linked tothe most specific issue with a relationship.203.2.1 Identifying ProjectsI focused on open source projects because the VCS for these projects are easilyaccessible. I used Github1 as a way to identify open source projects. Github is awidely used collaborative online platform for developers and companies to store,and share a project’s version control system. As all of the state-of-the-art tech-niques were built for Java projects, I selected Java projects to be evaluated. OnGithub, projects can be ordered by the number of stars that users of Github havegiven the project. I follow the lead of other researchers [13, 59], in using starordering to avoid projects of low quality [26, 40].To assess consistency, I required projects that capture work within issues anduse relationships between issues. I chose projects that use a Jira issue repositorybecause Jira issue repositories by default allow developers to make relationshipsbetween issues. To identify if a project uses Jira I search the “readme” file of theGithub project for the word “Jira.”3.2.2 Selecting ProjectsTo be suitable as a target for evaluations, I required projects to have manuallystated links by developers, and for the projects’ issue repository to use issue re-lationships. Manually stated links are needed because all of the state-of-the-artlinking techniques being evaluated need projects that have existing links. To assessconsistency, I need projects that have relationships between issuesTo ensure projects used relationships, 20 issues were randomly sampled fromthe Jira issue repository of each project. A project was selected if 25% or five ofthe 20 sampled issues participated in an issue relationship. I selected 25% as athreshold because work by Jankovic and colleagues showed an average of 24% ofissues had an issue relationship across 3 open source projects that used Jira [24].I did further data collection and downloaded all Apache projects that used JIRAand found 24% of over 4 million issues had an issue relationship. The projects thathave existing links and over 25% of the issues have a relationship are expected tobe typical open source projects.I had to consider the first 17 projects to find 10 suitable projects. The selected1github.com, verified 13 July 201721Table 3.1: Statistics of the projects selectedCommits Issues Issues with Relationshipcordova-android 2903 11957 4232Flink 9887 5935 1402H2o 3 16102 3954 1005Maven 10205 4954 2044Mongo Java 5561 2240 725Mongo Hadoop 1085 281 58Nutch 2212 2351 654Pentaho 16049 14282 5503Sonarqube 20905 7658 3623Spring Framework 13603 15118 4723projects are shown in Table 3.1 with the number of commits, issues, and the numberissues with a relationship.3.2.3 Technique ModificationsFor all the state-of-the-art techniques assessed, I needed to identify the links statedmanually by developers. I wrote a script that finds issue ids in commit messagesusing pattern recognition2. I selected commits in 2016 and issues that are created,commented or resolved in 2016 for each project. Table 3.2 shows the total numberof commits and issues for 2016 identified and the percentage of commits with linksidentified using the pattern recognition.Each state-of-the-art technique needs specific changes in order to map theproject data to the input each technique expects. I describe the changes made toeach technique in turn.RelinkTo use Relink, I needed to adjust the issues and commits for each project. Relinkexpect issues, and commits to be numbered, I mapped issues ids to numbers 1 to n,and commit SHA to numbers 1 to n. Relink also expects the mapped issue numberto be displayed in the commit message if there is a link, I created a mapping to2I used the following pattern, (?i) < pro jectname > [∧/w] ∗ /d+, for each project <project-name>is replaced with the name used in the issue id.22Table 3.2: The number of commits and issue in 2016, and % completeness ofmanually stated links by developersTotal Total ExistingCommits Issues Completeness %Cb 178 3121 47Flink 2015 2710 77H2o 3 3705 1950 21Maven 50 521 63Mongo Hadoop 53 278 43Mongo Java 245 2205 66Nutch 112 341 60Pentaho 928 9944 29Sonarqube 3214 7658 65Spring Framework 2163 1938 49identify issue ids in the commit message and map it to the correct number from theprevious step. For example when CONN-1190 is found in a commit message it isreplaced by 120.Phantom and LoanerThe issues for both Loaners and Phantoms needed to be adjusted for each project.Both techniques expect to link commits to issues that are in a “resolved” state.Projects have different development processes than the ones evaluated by the Phan-tom and Loaner techniques. I mapped “closed” issue status to “resolved” to matchexpected input for these techniques. Additionally, both techniques expect the is-sues to be classified by one of the three categories: bug, feature, other based onissue type. Using the definition of each category in their paper [45], I create thesame classifications for issue types not mentioned3.3The issue types fault, pruning and refactoring were categorized as bug type. Dependency up-grade, engineering story, new feature, request, spike, story, sub-task, technical task, and test all werecategorized as feature type. The rest were categorized as other type.23RCLinkRCLink expects a training set that has 10 times more commit issue pairs that areunlinked then then manually liked commits. To achieve this ratio, I randomlyselected from the commits with a link, and issues not linked to that commit. Thelast change I did was refactorings to their algorithms in order to speed up theirtechnique. Their algorithms had redundancy in the calculations, and I created amap to identify if an calculation was already made in order to avoid repetition. Therefactorings were validated on their test data.3.3 Comparison of State-of-the-Art TechniquesThis section discusses how state-of-the-art techniques perform with regards tocompleteness and consistency in the projects identified in the previous section.Completeness is evaluated by looking at the percentage of commits linked to anissue. Consistency is measured manually by sampling five links from each projectand determining if the commit is linked to the most specific issue. Five are chosenbecause it is a time consuming manual process.3.3.1 CompletenessTable 3.3 reports the percentage completeness of the linking relationship after ap-plying the state-of-the-art technique and the percentage of improvement. The re-sults show that RCLink has a higher percentage of completeness than the otherthree algorithms for most of the 10 projects. In some cases, for example H2O 3and Pentaho, it more than doubles the number of commits linked to an issue.Though in some situations, like with CB, Mongo Java, and Nutch, RCLinkdoes not have the most completeness, but is close to the better performing tech-nique, with only one or two percentage points different. RCLink performance maybe due to the technique being able to identify links in more situations. Phantomperformed the worst of the four algorithms only increasing completeness by 1%on 2 projects. The poor performance may be because the technique works only ininstances where projects have multiple commits that should link to a single issue.24Table 3.3: The % completeness for existing work and % improvement(Improv.) of linking relationship completeness of each state-of-the-arttechnique on selected projects.Existing Relink Loaner Phantom RCLinkCompleteness Improv. Improv. Improv. Improv.Cb 47 22 2 0 17Flink 77 2 0 0 6h2o 3 21 5 13 1 142Maven 63 0 0 0 0Mongo Hadoop 43 4 0 0 13Mongo Java 66 2 3 1 2Nutch 60 10 3 0 8Pentaho 29 1 1 0 123Sonarqube 65 0 1 0 14Spring Framework 49 2 5 0 263.3.2 ConsistencyTo evaluate consistency, I manually examine links to determine if the commit islinked to the most specific issue. I then report on whether a project has a consistentor inconsistent linking relationship, and if state-of-the-art techniques make changesto any of the links examined. I refer to the issue that is the source of a relationshipas the parent and the issue that is the target of a relationship as the child.For the 10 projects examined there are relationship types defined. Table 3.4shows all relationship types and the number of instances of each relationship typethat contain a issue with a link to a commit. I focus on issues with relationshipsthat break work down, where the parent issue is split into child issues, which allowvarious parts of an issue to be worked on separately. This breakdown of work maycause developers to inconsistently link commits to issues, (i.e. linking a commit toa parent rather than a child.) I examined Jira documentation and identified that the“sub-task” relationship is used to break an issue into child issues (the documenta-tion refers to these as smaller pieces of a larger task4.)As can be seen in Table 3.4 not all projects use the “sub-task” relationship. Imanually examined at all relationship types for the seven projects with less than five4confluence.atlassian.com/adminjiracloud/issue-types-844500742.html, verified 13 July 201725Table 3.4: Relationships instances in each project that have an issue linked to a commit. Bold numbers are relationshipsthat represent work breakdown in the issue repositoryCb Flink H2o 3 MavenMongoHadoopMongoJavaNutch Pentaho SonarqubeSpringFrameworkblocks 0 30 11 2 0 0 1 3 0 0breaks 0 0 0 1 0 0 2 0 0 0causes 0 0 2 0 0 0 0 0 0 0clones 7 1 2 0 0 0 0 3 0 0contains 0 4 0 0 0 0 0 0 0 0contributes to 0 0 0 0 0 0 0 0 4 0depends on 1 27 0 6 0 1 4 0 112 38deprecates 0 0 0 0 0 0 0 0 2 0duplicates 10 46 3 2 1 13 3 25 35 39implements 0 0 0 0 0 0 0 0 10 0incorporates 0 11 0 4 0 0 0 0 0 0part of Epic 0 0 54 0 3 21 0 0 0 0relates to 4 70 37 20 3 13 10 130 107 662replaces 0 0 0 0 0 0 0 0 2 0requires 0 22 0 2 0 0 0 0 0 0Sub-task 3 197 30 1 0 0 2 2 84 4supersedes 0 21 0 4 0 0 5 0 3 13total relationships 25 429 139 42 7 48 27 163 359 75626instances of a “sub-task” relationship; I discovered that “depends on” is also used inthese projects to break work down. Although “relates to” is used frequently acrossall projects, it is used to describe work breakdown and features that address thesame part of a project, due to its inconsistent use I do not use it in this evaluation.No work breakdown relationships were identified in the Mongo Hadoop afterlooking at all of the relationships.To select issues for the manual examination, I randomly sample up to five linksfrom each project. The issues in the links must meet the following criteria:1. part of a work breakdown relationship2. work that needs to be a parent in the relationship3. marked as resolved or closedThe first criteria selects only issues that are in a work breakdown relationship,(i.e. those issues that have a “sub-task” or “depends on” relationship.) The secondcriteria is used to select the issues that have their work to be broken down (i.e. Iselect the parent in the “sub-task” or ‘depends on” relationship), as developers maylink to the parent issues rather than the broken down parts. The last criteria is toselect issues that are resolved or closed to remain consistent with the method eachstate-of-the-art technique uses to link commits to issues. There were a total of 29links selected for the 10 projects, the number of links for each project is shown inTable 3.5. None links in the Mongo Java were to issues that needed work to bebroken down.The next step is to manually look at the commits selected and determine ifeach commit is linked to the most specific issue. To do the manual examinationof a link I look at the commit, the issue and all of related issues (i.e. those issuesthat have a relationship to the issue in the link.) The process to identify if a commitshould be linked to a different issue is done by comparing the issue title, descriptionand comments, with the commit message, files changed and changes to code, todetermine if the commit address the issue. This process is repeated for all relatedissues, and if another issue should be linked I mark the link as inconsistent.The results of manually examining the 29 links across the projects are reportedin Table 3.5. All links examined can be found in Appendix A I found inconsistency27Table 3.5: Results of manually examining consistency of up to five links fromeach project.# links examined Project consistencyCb 1 ConsistentFlink 5 InconsistentH2o 3 5 InconsistentMaven 3 ConsistentMongo Hadoop 0 N/AMongo Java 0 N/ANutch 3 ConsistentPentaho 2 InconsistentSonarqube 5 InconsistentSpring Framework 5 Inconsistentin four projects: Flink, H2o 3, Pentaho, and Sonar- qube. An exam-ple of inconsistency is the link from the H2o 3 project where the commit shouldlink to the most specific task (PUBDEV-2746) rather than the less specific task(FLINK-1843):Parent PUBDEV-1843: Grid testingChild PUBDEV-2746: Make sure GridSearch will print a warning for bad hyperparameter names and valuesCommit 162c909: PUBDEV-1843: grid test, subtask 7. Java error/warning mes-sages are ignored when they are generated for unit tests for gridsearch. Addedcode to print out these error/warning message during unit tests so we can bet-ter debug our code.In this example, the commit 162c909 is linked by the developer to the issuePUBDEV-1843 even though the child issue PUBDEV-2746 contains a descrip-tion discussing printing an error message. The commit message mentions “sub-task 7” and “printing error messages.” In addition, the source code of the commitmakes changes that relate to printing error messages in grid search. The correctlink should be to PUBDEV-2746 instead of PUBDEV-1843.I then look at the links created by each of the state-of-the-art techniques andreport if the commits are linked to the more specific task identified in the previous28step. I found Relink and RCLink both created a link to the more specific issueidentified for a link in the H2o 3 project.Parent PUBDEV-3482: Supporting GLM binomial model to allow two arbitraryinteger valuesChild PUBDEV-3791: Documentation: Add quasibinomomial family in GLMCommit 3ecf05e: PUBDEV-3482 Per Erin’s request, changed quasi binomial fam-ily name to quasibinomial.In this case, the commit 3ecf05e is linked by the developer to the issuePUBDEV-3482, and improved by both Relink and RCLink. The child issuePUBDEV-3791, contains comments discussing changing the name in source codefrom “quasi binomial” to “quasibinomial.” In addition, the source code of thecommit shows there are only rename changes: “quasi binomial” to “quasibino-mial.” Both Relink and RCLink both created a link to the more specific issuePUBDEV-3791. Although both approaches did not solve inconsistency for alllinks.3.3.3 Summary of Quality AssessmentI assessed four state-of-the-art techniques using two new quality attributes com-pleteness and consistency on 10 different projects. I found a lack of completenessand consistency in the linking relationship of the projects examined. Developersdid not link all commits to an issue to create a complete dataset in any of the 10projects. In addition, in four of the 10 projects, the linking relationship was notconsistent. I also found that although state-of-the-art techniques do make improve-ments to completeness, only Relink and RCLink made improvements to consis-tency of projects, though they did not make any of the projects fully consistent.29Chapter 4Creating a Complete andConsistent DatasetIn this thesis, I have motivated the need for improving software process data con-sisting of links between commits and issues. To help support the assessment oftechniques developed to improve the software process data automatically, it wouldbe useful to have a dataset that has a high degree of completeness and consistency.To test a new technique developed to improve software process data, one couldremove links from the dataset and test if they can be replaced.In this chapter, I describe the shortcomings of existing the datasets used to testproposed techniques to date and I describe a dataset, consisting of two projects, Ihave created to provide a suitable dataset.14.1 Existing DatasetsState-of-the-art techniques to link commits to issues are created and validated us-ing dataset two types of datasets. The first type of dataset is created as a byproductof developer’s work; while the second type of dataset is created manually by re-searchers.The first type of dataset that exists is created from a byproduct of the devel-opment process. Developers follow company or project development practices,1See www.cs.ubc.ca/labs/spl/projects/issueRelationships/30manually entering links between issues and commits, in issue descriptions andcommit messages2 (discussed in Chapter 2.2.1). As an example, links for theConnect project are stated by a developer adding an issue id in a fixed formatlike “CONN-<number>” to a commit message. This link information is used byresearchers to create datasets of linked issues and commits from projects such asApache, Mozilla, and Eclipse [10, 35, 45, 49, 56, 60]. This type of dataset has twolimitations, the first limitation is the links stated by developers are not validatedto be correct or consistent. The assumption by researchers is developers alwaysstate prefect links between issues and commits. The other limitation of this typeof dataset, as demonstrated in Chapter 3.3.2, is that developers do not always linkcommits to the most specific issue.To create the second type of dataset, researchers choose a project and go thro-ugh all commits and issues, manually identify the links that exist between commitsand issues. This technique has been used by Wu and colleagues on two projects:zxing and openintents [56]. This type of dataset has two limitations, the first isprojects selected for these datasets have issue repositories with a flat structure,meaning that the are no identified relationships between issues. Relationships be-tween issues are necessary to assess consistency of a linking relationship. Theother limitation of this dataset is there was no described process of how commitsare linked to issues, which prohibits reproducibility for future datasets.If state-of-the-art techniques are created using these incomplete and inconsis-tent datasets then the heuristics or machine learning algorithms for the techniquesmay make an incomplete and inconsistent linking relationship. Further, if the state-of-the-art techniques are validated on an incomplete and inconsistent dataset theresults may not be valid.4.2 Dataset Creation ProcessTo create a dataset with more complete and consistent links, I choose two projectsthat have existing relationships between issues, which is representative of how is-sue repositories are formed today. A time window of dates is selected from bothprojects to create a dataset of feasible size, as manually linking commits to issues2connectopensource.atlassian.net/wiki/display/CONNECTWIKI/How+to+Contribute+Code31Table 4.1: Commit and issue information for each project.Connect SonarqubeCommits 1707 9343% Commits with link 79 62Issues 1625 3500Relationships per issue 2.1 0.8% issues with relationship 89 50is a lengthy process.4.2.1 Projects SelectedI selected two projects Connect (data is reported from Dec 2012 to Nov 2014 in-clusive3) and Sonarqube (data is reported from Apr 2015 to May 2017) becauseboth have an active development process, are programmed in Java, use a Jira4 is-sue repository and use a VCS. I chose projects whose code is expressed in Java toremain consistent with the projects used to evaluate the state-of-the-art techniquesdescribed in related work. A Jira repository is important to the selection becauseby default it has issue relationships, as explained in the introduction. A VCS isneeded to have a history of commits to link to issues. Only the most recent twoyears are examined for both projects as it has the most recent representation of theprocess used by developers.Table 4.1 shows statistics for each project, including the number of issues andcommits. Developers in both projects manually linked over 60% of the commitsto at least one issue by adding the issue id in the commit message. Connect hasnearly three times as many relationships per issue as Sonarqube with 2.1 forConnect and 0.8 for Sonarqube. Both projects have issues with at least onerelationship, 89% in Connect and 50% in Sonarqube.To create the dataset, a subset of the data from both projects is used. To getan appropriate dataset for each project, a multi-day window is selected; for eachrepository at least 200 commits were selected, which is chosen as it is larger in3Data after Nov 2014 is not used for the Connect project, as Connect changed to a privateissue repository.4atlassian.com/software/jira, verified 13 July 201732Table 4.2: Comparison of relationship types for the window selected com-pared to the issue repository as a whole for both Connect andSonarqubeConnect SonarqubeJan 16th -Feb 13th 2013Dec 2012 -Nov 2014Apr 30th -May 20th 2015May 2015 -April 2017contributes to 4 (4%) 5 (0%)depended 16 (3%) 84 (7%) 22 (23%) 345 (25%)deprecates 0 (0%) 3 (0%)duplicated 4 (1%) 5 (0%) 6 (6%) 119 (9%)implements 1 (1%) 15 (1%)related 33 (35%) 423 (30%)replaces 1 (1%) 2 (0%)subtask 142 (31%) 342 (29%) 28 (29%) 437 (32%)superseded 0 (0%) 31 (2%)supported by 300 (65%) 752 (64%)size to those used by Wu and colleagues [56]5. The selection also needs to berepresentative of the whole issue repository with respect to the percentage of eachtype of issue relationship. Table 4.2 shows the window selected compared to thelatest two years of each project. A multi-day window of 58 days for Connectwas used to get 204 commits and 20 days for Sonarqube to get 202 commits.These were the minimum window sizes necessary to get windows of 200 commitsthat were representative of the whole issue repository.The multi-day window contains a set of commits from the git repository, usingthe master branch and all branches that have been merged into the master. The setof issues selected are all issues that were created, modified, or resolved within 24hours before and after the multi-day commit window. 24 hours is chosen becausea commit may occur within 24 hours of activity on an issue, according to researchby Wu and colleagues [56]. For each issue, its related issues are added to the set ofissues, because the related issues may be more specific to a commit with respect toconsistency.5zxing had 143 commits and and openintents had 129 commits334.3 Link IdentificationIn order to create the linking relationship for the dataset, I manually identified thelinks between commits and issues. To identify the linking relationship, the commitsand issues are pre-processed to prepare the links to be manually examined. Thenext step is to create the link between the commit and issue. A set of guidelinesis used to assist in identifying links that should be created. Issues were examinedaccording to three filters, the filters were used to remove issue not likely relevantto the commit.4.3.1 Preprocessing DatasetThe commits in the VCS contain links to issues that were manually stated by de-velopers. To remove any bias that might be introduced by these stated links, apre-processing step removes all references to git repositories in the text fields ofissues, and removes all references to issue ids in commit messages.References to a commit in the issue repository are either in the form of aURL linking to the code change in Github (i.e., github.com/CONNECT-Solution/CONNECT/pull/1335) or a hash id, which uniquely identifies the commit (i.e., 40characters consisting of letters and numbers). The references to commits can oc-cur in the text fields of the issues, such as the resolution (where developers put thedescription of how the issue was resolved), the comments, or the description. Inorder to identify and remove all commit references in issues, three patterns wereused to remove references to the git repository:• (?i)https://github.com/<project name>/pull/\d{1,3}• (?i)https://github.com/<project name>/commit/[a-z0-9]{40}• (?<=[\s/])[a-z0-9]{40}(?![\d\w])The first two patterns search for URLs to Github for either a pull request ora commit. These are used by developers to identify what commit addresses theissue. The last pattern is to identify hash ids; it looks for 40 characters that areletters and number that starts with a space and doesn’t end with a letter or number.When any of the patterns are identified, the text is removed and replaced with thetext commit reference redacted.34To remove issue ids from commit messages the same pattern used to identifystated links in Chapter 3.2.3 was used. The pattern used for each project is (?i) <pro jectname > [∧/w] ∗ /d+ the <projectname>is replaced with the name usedfor the issue id for each project, conn for Connect and sonar for Sonarqube.When any of the patterns are identified, the text is removed and replaced with thetext ISSUE#.4.3.2 Identification of LinksTo create the links for a project, each commit in the selected window for a projectis considered in order by of date, from the oldest to the most recent. For eachcommit, the goal is to identify one or more issues that should be linked. A set ofguidelines are used to identify whether a commit should be linked to one or moreissues. To limit the number of issues examined for each commit, filters are appliedto the set of issues, which return a subset of the issues to examine.The first step I used to identify a link between a commit and an issue is tounderstand the commit and to understand the issue. To understand the commit, Ilooked at the commit message, the names of files changed (including the completepath), and all additions or removals in each of the files changed to try to get anunderstanding of the purpose of the commit. If the commit message is vague (i.e.,it does not refer to code changes), I used the previous commit for additional context(if it exists, and is similar in the files altered). The previous commit is the commitmade immediately before the commit being inspected. To understand the issue, Ilooked at all of the text fields of the issue, including the issue title, description andcomments.Once I gained an understanding of the commit, I began to look for issues tolink using the following guidelines:• The meaning of the code changes in the commit, using either filenames,methods, variables, or features of the code, should reflect the meaning of theissue, based on the text in the issue• The author of the commit should be either the person assigned to the is-sue, the reporter of the issue, have commented on the issue, or have made achange to the status of the issue.35• Time should be considered while selecting the issue to be linked: the issueshould be resolved within seven days of the commit date for Connect and12 days for Sonarqube. The time of resolution considered for each projectis selected by looking at the average time between commit date and issueresolution of developer stated links.• If two issues are similar and the commit should link to only one of the issues,then use other commits by the same author to determine to which issue thecommit should be linked. Examples of other commits by the same authorare commits close to the same date, or are commits that changed the samefiles.To remove issue for consideration that are not likely relevant to the commitbeing examined, I used three different filters. The filters returned a subset of issuesbased on the rules for the filter. For each filter, I examined every issue returnedby the filter to determine if the commit should link to the issue, or an issue thatis related. After every issue, and related issue, returned by a filter is examined,the commit is linked to the issue or issues that are identified as being the mostspecific link in meaning to the commit. If no issue is identified from the list ofissues returned by a filter, then the next filter is used. The filters are used in order:1. The first filter orders the set of issues by the cosine similarity [47] of thewords in the commit to the words in issues. The words from the commit areselected from the commit message, file names of the changed files, and theadded lines of code in each file of a commit. Only code added in the commitis used to remove redundancy and false positives in cosine matching. The filenames and added code are tokenized to extract words from camel casing andunderscoring, while removing java keywords and stopwords. The words inthe issues are selected from the description, resolution, title, and comments.The filter returns a list of issues that have a cosine similarity of 0.2 or higher.Cosine similarity of 0.2 was chosen by selecting 50 random commits in theApache Hive project6 and identified the average cosine similarity between6hive.apache.org/, verified 13 July 201736the developer stated link to an issue. I found that the average cosine similar-ity was 0.28.2. The second filter considers the time between the commit and the issue re-solved date, and authorship, and is order by resolved date. The filter returnsa list of issues where the author of the commit is the person assigned to theissue or the person who reported the issue. The filter selects all issues thatare resolved a fixed time before and after a commit is made. The fixed timefor Connect is seven days and Sonarqube is 12 days.Assignee and reporter are chosen as filters as these values are the most likelyto represent the developers to make a change in a project. The days to filterfor each project is selected by looking at average time between commit dateand issue resolution of stated links.3. The third filter is the same as the previous filter with respect to time, but isdifferent with respect to authorship, and is order by resolved date. The filterreturns a list of issues where the author of the commit comments on the issueor makes a change to the status of the issue.This filter is to identify developers that may make a commit to address issuesthat they are not assigned. For example, a manager may add tests, or anotherdeveloper makes an edit to update code for the same issue.4.3.3 Dataset ValidityThe links created using this method may be biased by the overall knowledge andexperience of the person creating the links, and the lack of experience with the par-ticular projects studied. To minimize this bias and ensure the most specific linksare identified, I involved another researcher in the linking process. The first re-searcher identified links for the first 25 commits created using this method. Then,five random commits were selected from the first 25 commits and the other re-searcher independently applied the same process to link those commits to issues.The five random links were compared to the those identified by the first authorand any conflict was discussed. This process was repeated for the first 100 com-mits. Thereafter, five out of every 50 commits linked were randomly selected and37Table 4.3: Results of linking process in identifying links commits to issuesExisting Created LinksCompleteness % Completeness % Improvement %Connect 69 85 23Sonar 49 55 13validated with the other researcher. This same process was used to validate bothdatasets Connect and Sonarqube.Cohen’s kappa was used to measure agreement between the researchers andtakes into account the agreements that might occur due to chance. The kappa be-tween the two researchers was 0.82, and 0.67 for Connect and Sonarquberespectively. Using two researchers and Cohen’s kappa is standard in recent soft-ware engineering research [7, 30, 48, 58]. According to Landis and Koch, agree-ment between researchers with a kappa value from 0 to 0.2 is considered as slight,0.21 to 0.4 is fair, 0.41 to 0.6 is moderate, 0.61 to 0.8 is substantial, and 0.81 to 1is considered almost perfect [31].The filters may have caused a researcher to miss an issue to which a commitshould have been linked. To help mitigate this risk, the use of three filters is consid-ered to identify different subsets of issues that may be linked to the commit beingexamined. Two of the filters were also adjusted for each project specifically to ac-count for project specific processes, but still a potential link to an issue may havebeen missed with the method used.4.4 DatasetIn both Connect and Sonarqube, there was an improvement to the complete-ness of the linking relationship in each dataset. Table 4.3 shows the percentagecompleteness for links stated by developers compared to the links created by my-self and the other researcher. In Connect, 85% of the 204 commits were linked toan issue and in Sonarqube 55% of the 202 commits were linked to an issue. Thisimproved completeness compared to developer’s stated links by 23% in Connectand 13% in Sonarqube.An example in Connect dataset, I found a link between commit cba03ab to38issue GATEWAY-3435 that was not stated by the developer.Issue GATEWAY-3435: use different “soap” namespace for UDDI serverCommit cba03ab: modifed service port builder to use the new api defined in port-descriptornExamining the code of commit cba03ab shows three lines in the Java fileCXFServicePortBuilder are changed. The comments in issue GATEWAY-3435discuss setting the code changes in CXFServicePortBuilder. In the end, this is aminor change, but the developer did not link the commit to the issue.There was also a difference in the dataset compared to stated links with respectto consistency, which is used to identify if commits link to the most specific issues,of those issues that have a relationship. 52% of commits linked to an issue witha relationship in Connect and 34% in Sonarqube. Of the commits that werelinked to an issue with a relationship, those links that were different than the statedby developers were 48% in Connect and 54% in Sonarqube.An example of the difference in the Sonarqube project is the link commit885033d to issue SONAR-6414.Issue SONAR-6255: Move tests persistence from batch to computeIssue SONAR-6414: Tests - Index DB and ESCommit 885033d: index tests - SONAR-6255The developer linked the commit to issue SONAR-6255 which is in a “Sub-task” relationship with the child issue SONAR-6414. Examining the code inthe commit 885033d shows changes to elastic search (ES), and database (DB)test files. The commit 885033d should be linked to the more specific issueSONAR-6414, rather than the developer stated link to SONAR-6255.4.4.1 SummaryThe process of manually creating a dataset took over 212 hours for two researchersto complete. A dataset was created with a focus of making it both complete and39consistent by manually linking commits to issues without using developer’s statedlinks. The linking relationship in each dataset had an improvement in completenessby 23% and 13% in Connect and Sonarqube respectively. There was a differ-ence with respect to the consistency of the linking relationship; 48% of the com-mits linked to a issue with a relationship in Connect and 54% in Sonarqubewere different from the developer’s stated links. This dataset is available to otherresearchers who wish to have a more complete and consistent dataset to use fortesting linking techniques77github.com/whitecat/linked-dataset, verified 13 July 201740Chapter 5Semantic Relationships betweenIssuesA better understanding of how relationships are used in issue repositories and anability to recognize different uses of relationships provides opportunities to createnew, and improve existing, techniques to link commits to issues. For many softwaredevelopment projects, issue repositories hold key information defining what thesystem under development will do, who will work on different parts of the system,what defects occur as the system is being built, and more. When defining issues,software developers often expend manual effort to record relationships betweenissues, capturing such information as how work is to be broken down, how func-tionality in the system relates and which defects are similar to each other. When Iexamined the kinds of relationships in these repositories—by asking project devel-opers on the forums they use and by analyzing documentation—I learned that themost frequently occurring relationships describe work breakdowns,1 causing meto ask “how are software developers using work break down relationships in issuerepositories” [51]?To investigate this question, I performed a qualitative study of a sample ofwork breakdown relationships from the three repositories: Mylyn, Connect and1 In Mylyn, 59% of the relationships are depends-on, which represent work breakdowns.In HBase, sub-tasks are work breakdowns representing 30% of all relationships specified. InConnect, 79% of relationships are work breakdowns via the supported-by and sub-task relation-ships.41HBase. I had three researchers (authors of the MSR paper [51]) code the howthese relationships were being used based on an analysis of the titles of selectedissues. Through this coding, we determined six codes that describe the kinds ofwork breakdown relationships, ranging from describing particular cases in whicha more general problem must be solved to describing how functionality should beverified.5.1 Qualitative StudyThe qualitative study involved sampling pairs of related issues from the Mylyn,Connect and HBase issue repositories and performing an open coding of thesampled pairs.5.1.1 Coding ProcessI began by selecting pairs of issues related in work breakdown relationships fromthe Mylyn and Connect repositories. I chose these two systems to start the opencoding process [50] because I had knowledge that they each follow an agile devel-opment process and thus might share commonalities in how they use relationshipsin the issue repository.Three coders (the first, third and fourth authors of the MSR paper [51]) read thetitles of each issue in a selected pair and discussed the meaning of the relationshipbetween the pair. If the meaning had not yet been seen, a code was developed torecognize and describe how the issues are related and was recorded in a codebook.2In the first iteration, 40 issue pairs from Mylyn and 60 issue pairs from Connectwere randomly selected and coded.After coding the first 100 issue pairs, I randomly selected a different set of 60issue pairs from Mylyn and 90 issue pairs from Connect. Each of the three au-thors involved in the original iteration then coded 2 sets of 20 issue pairs (for a totalof 40) from Mylyn and 2 sets of 30 issue pairs (for a total of 60) from Connect.In this way, each set of 20 or 30 issue pairs was coded by two authors. The pairsof coders compared results and tried to reach a consensus on which code applies,updating the codes as necessary. To ensure the updated codes were appropriately2See www.cs.ubc.ca/labs/spl/projects/issueRelationships/, verified 13 July 201742applied, all three coders then re-coded all previously coded pairs. At this point nonew codes were found.To determine if the codebook was sufficiently general to cover another systemfor which it had not been developed, two coders separately coded 50 issue pairsfrom the HBase repository, which was chosen as having different developmentcharacteristics from the other two projects. Based on this coding, some guidelinerefinements to clarify code selection were made to the codebook. The two codersthen coded an additional 30 issue pairs from HBase to check if saturation hadbeen reached. To assess the inter-coder reliability, I computed Cohen’s kappa onthe final 30 issue pairs coded; the coders achieved a 0.56 kappa value. As will beexplained in the next section, some codes are related through a hierarchy; coderssometimes had differences in the level of code in the hierarchy assigned. If allsub-codes are collapsed to the super-code in the hierarchy, the kappa scores risesto CodesSix codes were identified from the coding process to more precisely describe themeaning of issues related through a work breakdown relationship: specificationrefinement, instance of parent, expectation, problem, check validity, and reversespecification. Pairs for which the meaning of the relationship could not be de-termined were coded as unknown. As Figure 5.1 shows, three of the codes are aspecialization of the specification refinement code. Any pair of issues coded wasassigned only one code from this set.Figure 5.1: Codes developed through open codingI describe the guidelines for each code in turn. For clarity in these guidelines,43I refer to the issue that is the source of the relationship as the parent and the issuethat is the target of the relationship as the child.Specification refinement This code applies when a child issue describes one stepof the work breakdown for the parent issue. The following example from Mylynillustrates this code as the child issue specifies actions to take towards improvingtooltip presentation.Parent 205861: Improve tooltip presentation and contentChild 238292: Show reporter and beginning of description text on new issuetooltips.Instance of parent This code applies when each child issue is a particular case inwhich the work described by the parent issue should be performed. The followingexample from Connect illustrates a child issue that specifies work from the parentis to occur on Windows machines, the same work can be done also on other typesof machines.Parent GATEWAY-1664: Create new VMs for final release installation testing.Needed by Friday 3/9Child GATEWAY-1667: 4 Windows machinesCheck validity This code applies when a child issue describes a verification ac-tivity for a parent issue. For instance, in HBase, the child issue describes addingtests to show the feature, described by the parent issue, works correctly.Parent HBASE-10070: HBase read high-availability using timeline-consistent re-gion replicas gitChild HBASE-10791: Add integration test to demonstrate performance improve-ment44Expectation This code applies when the child issue describes constraints or sug-gestions on how a parent issue can be fulfilled. The child issue in these cases oftenuses words like should, must, need, ensure, and improve. The following examplefrom Mylyn shows how the child issue constraints the parent issue requirements.Parent 158921: Improve the issue editor usability and information densityChild 212953: Depends on field in issue editor should fill available horizontalspaceProblem This code applies when the child issue describes a problem that occursin a parent issue. For instance, from Connect, the parent issue describes perform-ing transaction logging and the child issue describes a particular part of the systemrequiring attention.Parent GATEWAY-2151: Transaction LoggingChild GATEWAY-2782: non-unique messageid causes transaction not to be loggedin transaction repoReverse specification This code applies when a parent issue describes one stepof the work breakdown for the child issue. In other words, it is the reverse ofspecification refinement. For instance, from Connect the parent issue is a specificcase of the child issue to investigate tests for concurrent messages.Parent CONN-910: Execute concurrent tests from 3.3 gateway to 4.3 to ensureturning off of replay attacks fixes the issueChild CONN-859: Investigate and research issue when concurrent messages aresent from connect 3.x gateway to connect 4.2 gatewayUnknown When none of the six codes just described apply to an issue pair, orwhen both reverse specification and specification refinement seem applicable, Iconsider the relationship meaning for an pair to be unknown. For instance, in thefollowing Connect example, the parent issue describes an action to perform butthe child issue is a noun phrase.45Table 5.1: Codes occurring in each repositoryMylyn Connect HBaseCode % # % # % #Spec. Refinement 48.0 48 42.0 66 47.5 38Check Validity 4.0 4 14.0 21 5.0 4Instance of Parent 5.0 5 21.3 32 7.5 6Expectation 14.0 14 3.3 5 21.3 17Problem 22.0 22 6.7 10 11.3 9Reverse Spec. 3.0 3 2.7 4 0.0 0Unknown 3.0 3 7.3 11 6.3 5No consensus 1.0 1 2.7 4 1.3 1Total 100 150 80Parent CONN-1094: Create static screens for Direct configuration in Administra-tive GUIChild CONN-1105: Trust Bundles5.1.3 Threats to ValidityThe codes may be biased by the knowledge and experience of the coders. The useof three coders helps minimize this bias. By separating into pairs to code after thedevelopment of the initial code book, I helped mitigate the persuasive affect of anyone coder.With each repository coded, the coders clarified the code book. More refine-ments may be necessary if applied to other repositories, limiting the external va-lidity of the results. Coding only the title of an issue also limits the results. Moreclarity on how a relationship is used might have been gained by using more infor-mation from the issue, such as comments about the work actually undertaken aspart of the issue.5.1.4 ResultsTable 5.1 shows the results of coding 330 issue pairs across the three reposi-tories. The most prevalent code across all repositories was specificationrefinement. Interestingly, the more specialized version, check validity,46occurs much more often in Connect than in the other repositories; perhaps theother repositories do not explicitly record their quality assurance related tasks.Mylyn contains more issue pairs describing problems than the other reposito-ries; this may be due to the high number of issue reporters who are not contributingdevelopers. The expectation code occurs more frequently in HBase, perhapsbecause developers perform more analysis of work breakdowns before specifyingchild tasks. The instance of parent occurs more frequently in Connect,suggesting that the developers more frequently refer to structural parts of the sys-tem when specifying work breakdown issues.5.2 SummaryDevelopers often expend manual effort to specify how issues in an issue repositoryrelate, especially to express how work is to be broken down and performed on thesystem. To investigate what kinds of work breakdowns are being expressed, Inconjunction with three colleagues, we performed an open coding of a sample of330 related issue pairs from the issue repositories of three open source systems:Mylyn, Connect and HBase. The open coding progress resulted in six codesthat describe a variety of kinds of work breakdowns, including cases where thework breakdowns express steps of verification and express constraints on work tobe performed.This study is the first to provide insight into the richness of information embed-ded in relationships in issue repositories. This information offers new opportunitiesto link commits to issues.47Chapter 6Discussion and Future WorkAn understanding of how issue relationships are used may help inform the devel-opment of new tools and may help improve the linking of commit and issue data.6.1 Improving Links of Commits to IssuesDespite the rich amount of relationship information contained in an issue reposi-tory, none of the existent techniques use that information to create or refine links.State-of-the-art techniques to improve this link information might use relationshipsto further improve link information.For example, the technique introduced by Schermann and colleagues intro-duces heuristics that reduce the number of unlinked issues in two ways [45]. Thefirst heuristic technique “Phantom” creates links between issues and commits, find-ing multiple commits that belong to a single issue using existing links. The sec-ond heuristic technique “Loaner” links single commits with no issue id to linkto an issue with no existing link. Figure 1.1 provides an example of how rela-tionship information might improve these heuristics. The work by Sherman andcolleagues, describes the “phantom” commit 14cd8d6b3 should be linked to theissue CAMEL-7354 [45] (see Figure 1.1a). However CAMEL-73541 has sevensubtasks and an inspection of subtasks shows commit 14cd8d6b3, should belinked to one of the sub-tasks, CAMEL-76752. Introducing relationship informa-1issues.apache.org/jira/browse/CAMEL-7354, verified 13 July 20172issues.apache.org/jira/browse/CAMEL-7675, verified 13 July 201748tion, such as relationship category, could improve linking techniques by refiningthe existing links by moving commits to the correct level of work breakdown (seeFigure 5.1).6.2 Techniques to Categorize Issue RelationshipsAn indication of the semantic relationship between two issues may help furtherimprove techniques to improve linking commits to issues. The assignment of se-mantic codes to relationships in this thesis is referred to as “categorizing”, build-ing on the manual work described in Chapter 5.1. As developers do not benefitdirectly from the categorizing relationships, future work may consider automaticcategorization of work breakdown relationships.6.2.1 AlgorithmThe issues in issue repositories are entered largely manually by developers to de-scribe work to be performed and problems encountered with the system. As a resultof this manual entry, the natural language used to describe the issues varies exten-sively. Two steps may help provide an automated approach to categorizing issuerelationships. First, the language found in issues could be transformed to a morerestricted set to identify the similarities in how issues relate. Second, the workbreakdown issue pairs in an issue repository could be automatically categorizedusing the similarities identified and supplementary materials, such as informationabout the structure of the system.Algorithm for TransformationNasukawa and colleague found transforming large amounts of textual data intoa smaller dataset assisted in knowledge discovery [34]. Such an approach mayhelp in categorizing relationships. The language in the issue repository could betransformed using a restricted language created with verbs, code terms, and nounphrases found in the issue repository, in combination with other sources such asweb sites and word net corpuses. Then, the restricted language could be used totransform the title of each issue. The output of the transformation could be oneor more fields of issues in the issue repository augmented with data containing a49Table 6.1: Example of transformation using a restricted language.Input OutputMylyn [api] extract core utility classes from org.eclipse.monitor.core [api] extract *concept* from *code*[user] See relative path of the file from the compare editor tooltip [user] See *concept* from *concept*Be able to cancel the treading process when generating a report Be able to cancel the *concept* when *concept*ConnectCreate testing artifacts for Admin GUI Add testing artifacts for *concept*Increase Code Coverage for Callback package Increase Code Coverage for *code* packageInstall and test GF 3.1.2 on windows Install and test *version* on *system*Base Add lint for hbase-annotations being used. Add *concept* for *concept* being used.make sure HBase APIs are compatible for 0.94 make sure *system* are compatible for *version*Persist memstoreTS to disk Persist *code* to *concept*restricted language.A concrete example of transformation by restricting the language in titles isshown in Table 6.1. In this example, the titles use the set of restriction rules, re-placing words with higher level structure which groups terms using different algo-rithms.Transforming language, by restricting the set of words found in issue text, canassist a categorizer in identifying how issues relate, though increasing similarities.A group of transformation algorithms could be necessary which focus on reducingthe variety of text across an issue repository. The input to these algorithms canbe word graphs of title fields generated by NLP, which includes stemming and thegrammatical relationships between words [14]. The algorithms were envisionedafter creating the semantic codes in Chapter 5, which were chosen by looking atwhat changed between two titles, and what action is taking place.The rationale for each algorithm is described in turn:Code elements: This algorithm finds text that refers to elements of code, asdevelopers may refer to specific code elements when doing work. This can beimplemented using work by Bacchelli and colleagues who found camel casing andnon English dictionary words were an effective indicators of code elements in text[3]. The output of the algorithm replaces code elements with the word *code*.Concept: The goal of this algorithm is to replace project specific program con-cepts with the word *concept*. The algorithm can be implemented with the as-sistance of concept location techniques. Multiple techniques to implement concept50location exist such as latent semantic indexing, formal concept analysis, programdependence analysis, and term frequency. This can be implemented using workdone by Poshyvanyk and colleagues [37] which uses latent semantic indexing on asource code corpus with a query. The query for this algorithm is a noun phrase,and the source code is the version control system related to the issue repository.Version: This algorithm can find noun phrases referring to different versions ofsoftware in an issue repository, for example “Connect 3.1” specifies the 3.1 versionof connect. Using pattern matching version numbers can be found in a noun phrase.The output of this algorithm replaces noun phrases with the word *version*.External system: This algorithm finds references to external libraries or sys-tems in the title of issue repositories. Using Stackoverflow3 tags a list of librariesand systems can be created. Nouns in text matching a tag would be replaced withthe word *system*.Verb grouping: This algorithm minimizes the number of different verbs in text.To do this all verbs in an issue repository are clustered using DBSCAN cluster-ing algorithm [20] and similarity measures between verbs using a word similaritydatabase. The clusters can be labeled with the most occurring verb in the clus-ter, verbs are replaced using the cluster label. Multiple word similarity databasesexist such as freebase [12], VerbNet [46], and WordSimSEDB [52]. The WordSimSEDBdatabase is used for this algorithm as it is tailored to software engineering words.Algorithm for CategorizationUsing the restricted language, a categorizer algorithm could be created, based onsupervised machine learning. The algorithm would consider issue pairs that aregathered the same way as in Chapter 5. A categorizer could be trained using infor-mation in each issue pair, such as restricted language and authorship. The catego-rizer could then be used to categorize all work breakdown issue pairs in an issuerepository.For instance, an algorithm might learn when an issue in a pair has a title withthe word test is always categorized as check validity. In Figure 6.1, thecategorizer takes the issue pair [GATEWAY-1911, GATEWAY-1996] applies the3www.stackoverflow.com, verified 13 July 201751algorithm and labels the pair as check validity.Uncategorized(Issue pair) [{ GATEWAY-1911: *component* provide *version*GATEWAY-1996: test for GATEWAY-1911}]Categorized(Issue pair) [{GATEWAY-1911, GATEWAY-1996, check validity}]Figure 6.1: Example of categorizing with a supervised machine learning al-gorithm that learned to label issue pairs as check validity if the childissue the contains the word test.Supervised machine learning can be used if categorization of issue relation-ships is framed as a text categorization problem, which assigns a category to eachdocument in a set from a list of possible categories [25]. For this approach, an issuepair is considered a document and the list of categories comes from the semanticcategories discussed in Chapter 5.Features for Machine Learning Machine learning algorithms need features as in-put. The features come from two areas of an issue ownership, and transformed title.The transformed title is used because there is more regularity between titles, andthe increased regularity potentially increases patterns found using machine learn-ing. Ownership is another important feature to include because work by Hindleand colleages [23] finds ownership an important feature in categorizing large is-sues. Features are extracted individually from both issues in an issue pair. Featuresare explained as follows:• The transformed title is a brief explanation of work to be completed for anissue, using the restricted language. The feature is a vector using frequencyof terms in text. The vector is normalized using inter-document frequency,intra-document frequency and based on document length [41].• The verb group in issue text is the action used to describe work needed to52complete an issue. This feature is the verb group of the first verb in the issuetitle.• Code elements in issue text are references to where changes are, or might bemade to complete an issue. This feature is the existence of code elements inthe issue title.• Sentences that start with noun/verb/neither is an indicator of the type of workto be performed. The feature is whether a text field in an issue starts with anoun, verb, or neither.• Ownership of issue is the person who is assigned to complete an issue. Thefeature is which person is assigned to complete the issue.Each of these features could be extracted from both issues in an issue pair andlabeled whether it relates to the start or end node of the issue pair. These couldthen be used as input to the categorizer.Potential Machine Learning Algorithms Future work could investigate two su-pervised machine learning algorithms: Naive Bayes, and Support Vector Machine(SVM). Naive Bayes algorithm is of interest because it is possible to implement asan online algorithm, which can simulate being updated as issues are entered intoan issue repository. Support vector machine is an obvious algorithm to investigatebecause it is tailored to do text categorization, and can reduce the need of labeledtraining data [54]. Recent work by Rahoman and colleagues supports using SVMas a machine learning technique to categorize text [38].53Chapter 7ConclusionThe goal of this thesis is to improve the quality of software process data in the formof links between commits and issues. There is no standard approach to assess thequality of the links between commits and issues. I introduced the quality attributesof consistency and completeness as means to assess the quality of the existing linksbetween commits and issues. Completeness addresses whether all commits link toan issue in the issue repository. Consistency is used to measure if all commits linkto the most relevant issue in the issue repository. I found existing techniques thataim to improve links between issues and commits produced results that do not fullyaddress completeness and consistency.To help future researchers investigate new techniques for improving link databetween commits and issues, I created a dataset consisting of a portion of two opensource projects. I found the manually created dataset had an improvement of 23%and 13% in Connect and Sonarqube respectively with respect to completenesswhen compared to the data entered in the original systems by developers. Withrespect to consistency, the dataset had many commits which were linked to anissue with a relationship change, 48% in Connect, and 54% in Sonarqubewere different from the developer’s stated links. Future researchers can used thisdataset to create a technique to link commits to issues.Relationships that exist between issues have not been considered in researchto improve software process data. Issue relationships hold information about howwork is broken down in a system or how the functionality in a system is related.54One problem that exists with the relationships between issues is the ambiguityamong the different meanings of relationship types which may lead to developersto mislabel a relationship. I explored the different kinds relationship types in issuesin three repositories: Mylyn, Connect, and HBase, and found work breakdownrelationships were the most common among all three issue repositories. I thenconducted a study to identify the underlying meanings of issues that have workbreakdown relationships. The study found six primary underlying meanings forwork breakdown relationships.Future researchers can build on these results to build and test enhanced ap-proaches to improve link data between commits and issues.55Bibliography[1] J. Anvik, L. Hiew, and G. C. Murphy. Coping with an open bug repository.In OOPSLA Workshop on Eclipse Technology eXchange, eclipse, pages35–39, 2005. → pages 14, 18[2] J. Anvik, L. Hiew, and G. C. Murphy. Who should fix this bug? InInternational Conference on Software Engineering, ICSE, pages 361–370,2006. → pages 14[3] A. Bacchelli, M. D’Ambros, M. Lanza, and R. Robbes. Benchmarkinglightweight techniques to link e-mails and source code. In WorkingConference on Reverse Engineering, WCRE, pages 205–214, Oct 2009. →pages 50[4] A. Bachmann, C. Bird, F. Rahman, P. Devanbu, and A. Bernstein. Themissing links: bugs and bug-fix commits. In International Symposium onFoundations of Software Engineering, FSE, pages 97–106, 2010. → pages11[5] S. Banerjee, J. Helmick, Z. Syed, and B. Cukic. Eclipse vs. mozilla: Acomparison of two large-scale open source problem report repositories. InInternational Symposium on High Assurance Systems Engineering, HASE,pages 263–270, 2015. → pages 6, 14[6] N. Bettenburg, S. Just, A. Schro¨ter, C. Weiss, R. Premraj, andT. Zimmermann. What makes a good bug report? In Joint Meeting of theEuropean Software Engineering Conference and Symposium on TheFoundations of software engineering, ESEC/FSE, pages 308–318, 2008. →pages 6, 14[7] S. Beyer and M. Pinzger. Grouping android tag synonyms on stack overflow.In International Workshop on Mining Software Repositories, MSR, pages430–440, 2016. → pages 3856[8] C. Bird, A. Bachmann, E. Aune, J. Duffy, A. Bernstein, V. Filkov, andP. Devanbu. Fair and balanced?: bias in bug-fix datasets. In Joint meeting ofthe European Software Engineering Conference and InternationalSymposium on Foundations of Software Engineering, ESEC/FSE, pages121–130, 2009. → pages 1, 2, 18[9] C. Bird, N. Nagappan, H. Gall, B. Murphy, and P. Devanbu. Putting it alltogether: Using socio-technical networks to predict failures. In InternationalSymposium on Software Reliability Engineering, ISSRE, pages 109–119,2009. → pages 2[10] T. F. Bissyande, F. Thung, S. Wang, D. Lo, L. Jiang, and L. Reveillere.Empirical evaluation of bug linking. In European Conference on SoftwareMaintenance and Reengineering, CSMR, pages 89–98, 2013. → pages 12,13, 31[11] B. W. Boehm. A spiral model of software development and enhancement.Computer, 21(5):61–72, 1988. → pages 15[12] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: acollaboratively created graph database for structuring human knowledge. InInternational Conference on Management of Data, COMAD, pages1247–1250, 2008. → pages 51[13] C. Casalnuovo, P. Devanbu, A. Oliveira, V. Filkov, and B. Ray. Assert use ingithub projects. In International Conference on Software Engineering,ICSE, pages 755–766. IEEE, 2015. → pages 21[14] D. Chen and C. D. Manning. A fast and accurate dependency parser usingneural networks. In Conference on Empirical Methods in Natural LanguageProcessing, EMNLP, pages 740–750, 2014. → pages 50[15] M. Choetkiertikul, H. K. Dam, T. Tran, and A. Ghose. Predicting delays insoftware projects using networked classification. In International conferenceon Automated software engineering, ASE, pages 353–364, 2015. → pages 1,15, 17, 20[16] J. Cook and A. Wolf. Automating process discovery through event-dataanalysis. In International Conference on Software Engineering, ICSE, pages73–82, 1995. → pages 15[17] D. Cˇubranic´. Automatic bug triage using text categorization. InInternational Conference on Software Engineering and KnowledgeEngineering, SEKE, pages 92–97, 2004. → pages 1457[18] D. Cubranic and G. Murphy. Hipikat. In International Conference onSoftware Engineering, ICSE, pages 408–418, 2003. → pages 12[19] B. Duan and B. Shen. Software process discovery using link analysis. InInternational Conference on Communication software and networks,ICCSN, pages 60–63, 2011. → pages 15[20] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm fordiscovering clusters in large spatial databases with noise. In InternationalConference on Knowledge Discovery and Data Mining, KDD, pages226–231. AAAI, 1996. → pages 51[21] M. Fischer, M. Pinzger, and H. Gall. Populating a release history databasefrom version control and bug tracking systems. In International Conferenceon Software Maintenance, pages 23–32, 2003. → pages 1, 12, 17[22] A. Hindle, M. W. Godfrey, and R. C. Holt. Software process recovery usingrecovered unified process views. In International Conference on SoftwareMaintenance, ICSM, pages 1–10, 2010. → pages 15[23] A. Hindle, D. German, and R. Holt. What do large commits tell us? InInternational Workshop on Mining Software Repositories, MSR, pages99–108, May 10, 2008. → pages 2, 12, 52[24] M. Jankovic and M. Bajec. Comparison of software repositories for theirusability in software process reconstruction. In International Conference onResearch Challenges in Information Science, RCIS, pages 298–308, 2015.→ pages 6, 14, 21[25] T. Joachims. Text categorization with support vector machines: Learningwith many relevant features. In European conference on machine learning,ECML, pages 137–142, 1998. → pages 52[26] E. Kalliamvakou, G. Gousios, K. Blincoe, L. Singer, D. M. German, andD. Damian. The promises and perils of mining github. In WorkingConference on Mining Software Repositories, MSR, pages 92–101. ACM,2014. → pages 21[27] S. Kim, T. Zimmermann, E. J. Whitehead Jr, and A. Zeller. Predicting faultsfrom cached history. In International conference on Software Engineering,pages 489–498, 2007. → pages 1, 2, 12, 1758[28] E. Kindler, V. Rubin, and W. Schfer. Activity mining for discoveringsoftware process models. Software Engineering, 79:175–180, 2006. →pages 15[29] A. J. Ko, B. A. Myers, and D. H. Chau. A linguistic analysis of how peopledescribe software problems. In Visual Languages and Human-CentricComputing, VL/HCC, pages 127–134, 2006. → pages 14[30] O. Kononenko, O. Baysal, and M. Godfrey. Code review quality. InInternational Conference on Software Engineering, ICSE, pages 1028–1038,2016. → pages 38[31] J. R. Landis and G. G. Koch. The measurement of observer agreement forcategorical data. biometrics, pages 159–174, 1977. → pages 38[32] T. D. B. Le, M. Linares-Vasquez, D. Lo, and D. Poshyvanyk. Rclinker:Automated linking of issue reports and commits leveraging rich contextualinformation. In International Conference on Program Comprehension,ICPC, pages 36–47, May 2015. → pages 12, 13, 20[33] A. Mockus, R. T. Fielding, and J. D. Herbsleb. Two case studies of opensource software development: Apache and mozilla. Transactions onSoftware Engineering and Methodology, 11(3):309–346, July 2002. →pages 2, 6, 12, 14[34] T. Nasukawa and T. Nagano. Text analysis and knowledge mining system.IBM Systems Journal, 40(4):967–984, 2001. → pages 49[35] A. Nguyen, T. Nguyen, H. Nguyen, and T. Nguyen. Multi-layered approachfor recovering links between bug reports and fixes. In InternationalSymposium on Foundations of Software Engineering, FSE, pages 1–11,2012. → pages 12, 13, 31[36] W. Poncin, A. Serebrenik, and M. van den Brand. Process mining softwarerepositories. In European Conference on Software Maintenance andReengineering, CSMR, pages 5–14, 2011. → pages 15[37] D. Poshyvanyk and A. Marcus. Combining formal concept analysis withinformation retrieval for concept location in source code. In InternationalConference on Program Comprehension, ICPC, pages 37–48, 2007. →pages 5159[38] M.-M. Rahoman, T. Nasukawa, H. Kanayama, and R. Ichise. Licord:Language independent content word finder. In International Conference onHybrid Artificial Intelligence Systems, HAIS, pages 40–52, 2016. → pages53[39] S. Rastkar and G. C. Murphy. Why did this code change? In InternationalConference on Software Engineering, ICSE, pages 1193–1196, 2013. →pages 15[40] B. Ray, D. Posnett, V. Filkov, and P. Devanbu. A large scale study ofprogramming languages and code quality in github. In InternationalSymposium on Foundations of Software Engineering, FSE, pages 155–165.ACM, 2014. → pages 21[41] J. D. Rennie, L. Shih, J. Teevan, and D. R. Karger. Tackling the poorassumptions of naive bayes text classifiers. In International conference onmachine learning, ICML, pages 616–623, 2003. → pages 52[42] H. Rocha, G. Oliveira, H. Marques-Neto, and M. Valente. Nextbug: abugzilla extension for recommending similar bugs. Journal of SoftwareEngineering Research and Development, 3(1):3, 2015. → pages 14[43] P. Runeson, M. Alexandersson, and O. Nyholm. Detection of duplicatedefect reports using natural language processing. In InternationalConference on Software Engineering, ICSE, pages 499–510, 2007. → pages14[44] R. J. Sandusky, L. Gasser, and G. Ripoche. Bug report networks: Varieties,strategies, and impacts in a f/oss development community. In Workshop onMining Software Repositories, MSR, pages 80–84, 2004. → pages 14[45] G. Schermann, M. Brandtner, S. Panichella, P. Leitner, and H. Gall.Discovering loners and phantoms in commit and issue data. In InternationalConference on Program Comprehension, ICPC, 2015. → pages 2, 12, 13,20, 23, 31, 48[46] K. K. Schuler. Verbnet: A broad-coverage, comprehensive verb lexicon.Ph.D. thesis, Univ. of Pennsylvania, 2005. → pages 51[47] A. Singhal. Modern information retrieval: A brief overview. IEEE DataEngineering Bulletin, 24(4):35–43, 2001. → pages 3660[48] R. Slavin, X. Wang, M. Hosseini, J. Hester, R. Krishnan, J. Bhatia,T. Breaux, and J. Niu. Toward a framework for detecting privacy policyviolations in android application code. In International Conference onSoftware Engineering, ICSE, pages 25–36, 2016. → pages 38[49] J. Sliwerski, T. Zimmermann, and A. Zeller. When do changes induce fixes?In Workshop on Mining Software Repositories, number 4 in MSR, pages1–5, 2005. → pages 2, 12, 31[50] A. Strauss and J. M. Corbin. Basics of qualitative research: Groundedtheory procedures and techniques. Sage Publications, Inc, 1990. → pages 42[51] C. A. Thompson, G. C. Murphy, M. Palyart, and M. Gasˇparicˇ. How softwaredevelopers use work breakdown relationships in issue repositories. InInternational Conference on Mining Software Repositories, MSR, pages281–285, 2016. → pages 41, 42[52] Y. Tian, D. Lo, and J. Lawall. Automated construction of a software-specificword similarity database. In Software Maintenance, Reengineering andReverse Engineering, CSMR-WCRE, pages 44–53, 2014. → pages 51[53] W. F. Tichy. Rcs - a system for version control. Software: Practice andExperience, 15(7):637–654, 1985. → pages 5[54] S. Tong and D. Koller. Support vector machine active learning withapplications to text classification. The Journal of Machine LearningResearch, 2:45–66, 2002. → pages 53[55] X. Wang, L. Zhang, T. Xie, J. Anvik, and J. Sun. An approach to detectingduplicate bug reports using natural language and execution information. InInternational Conference on Software Engineering, ICSE, pages 461–470,2008. → pages 14[56] R. Wu, H. Zhang, S. Kim, and S.-C. Cheung. Relink: recovering linksbetween bugs and changes. In European conference on Foundations ofsoftware engineering, FSE, pages 15–25, 2011. → pages 12, 13, 20, 31, 33[57] X. Xia, D. Lo, E. Shihab, X. Wang, and B. Zhou. Automatic, high accuracyprediction of reopened bugs. In International Conference on AutomatedSoftware Engineering, number 1 in ASE, pages 75–109, 2015. → pages 12[58] A. Zagalsky, C. G. Teshima, D. M. German, M.-A. Storey, andG. Poo-Caaman˜o. How the r community creates and curates knowledge: a61comparative study of stack overflow and mailing lists. In InternationalWorkshop on Mining Software Repositories, MSR, pages 441–451, 2016. →pages 38[59] Y. Zhang, D. Lo, P. S. Kochhar, X. Xia, Q. Li, and J. Sun. Detecting similarrepositories on github. In International Conference on Software Analysis,Evolution and Reengineering, SANER, pages 13–23, 2017. → pages 21[60] T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy.Cross-project defect prediction: a large scale experiment on data vs. domainvs. process. In Joint Meeting of the European Software EngineeringConference and Symposium on The Foundations of Software Engineering,ESEC/FSE, pages 91–100, 2009. → pages 3162Appendix ALinks examined for consistency63Table A.1: Links examined for consistency evaluation in Chapter 3. Nochanges to the links were made by Phantom or Loaner.Project hash id Existing Link Relink RCLink Most specific LinkCB dcada79 CB-11921CB-11917,CB-9656Flink67ca4a4 FLINK-3200 FLINK-320177eb4f0 FLINK-3207ceb6424 FLINK-3435d353895FLINK-2314,FLINK-3717,FLINK-3889FLINK-3889,FLINK-3808,FLINK-3717dec0d6b FLINK-4410 FLINK-4697h2o 3025b3e1 PUBDEV-2922041bf21 PUBDEV-3664PUBDEV-3695,PUBDEV-3667162c909 PUBDEV-1843 PUBDEV-27463ecf05e PUBDEV-3482 PUBDEV-3791PUBDEV-3482,PUBDEV-3695,PUBDEV-3709,PUBDEV-3791PUBDEV-37914aa59cb PUBDEV-2729maven1cb2e92 MNG-614094bc4de MNG-6093b80915b MNG-3507nutch1aa67f7 NUTCH-2221 NUTCH-21445784b64 NUTCH-961f2f2ed6 NUTCH-1233pentaho ee561dc PDI-14132 PDI-14800 PDI-14855410d77c PDI-14132 PDI-14855sonar2715a07 SONAR-76342e2d8c7 SONAR-8221 SONAR-82413a56577 SONAR-80283f9038dSONAR-7970,SONAR-7986 SONAR-7973c06807d SONAR-8120 SONAR-8166spring frameworkec1eb14 SPR-1468099cacaaSPR-14680,SPR-14865 SPR-14865770f0c0 SPR-13495bc14c5b SPR-1452159c88eb SPR-13475SPR-13486,SPR-1397364


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items