Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A study of bugs in test code and a test model for analyzing tests Vahabzadeh Sefiddarbon, Arash 2016

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2016_november_vahabzadehsefiddarbon_arash.pdf [ 858.69kB ]
JSON: 24-1.0319208.json
JSON-LD: 24-1.0319208-ld.json
RDF/XML (Pretty): 24-1.0319208-rdf.xml
RDF/JSON: 24-1.0319208-rdf.json
Turtle: 24-1.0319208-turtle.txt
N-Triples: 24-1.0319208-rdf-ntriples.txt
Original Record: 24-1.0319208-source.json
Full Text

Full Text

A Study of Bugs in Test Code and a Test Model forAnalyzing TestsbyArash Vahabzadeh SefiddarbonB.Sc., Sharif University of Technology, Iran, 2014A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of Applied ScienceinTHE FACULTY OF APPLIED SCIENCE(Electrical and Computer Engineering)The University of British Columbia(Vancouver)October 2016c© Arash Vahabzadeh Sefiddarbon, 2016AbstractTesting has become a wide-spread practice among practitioners. Test cases arewritten to verify that production code functions as expected and are modifiedalongside the production code. Over time the quality of the test code can degrade.The test code might contain bugs, or it can accumulate redundant test cases or verysimilar ones with many redundant parts. The work presented in this dissertationhas focused on addressing these issues by characterizing bugs in test code, andproposing a test model to analyze test cases and support test reorganization. Tocharacterize the prevalence and root causes of bugs in the test code, we mine thebug repositories and version control systems of 448 Apache Software Foundationprojects. Our results show that around half of all the projects had bugs in their testcode; the majority of test bugs are false alarms, i.e., test fails while the productioncode is correct, while a minority of these bugs result in silent horrors, i.e., testpasses while the production code is incorrect; missing and incorrect assertionsare the dominant root cause of silent horror bugs; semantic, flaky, environmentrelated bugs are the dominant root cause categories of false alarms. We present atest model for analyzing tests and performing test reorganization tasks in test code.Redundancies increase the maintenance overhead of the test suite and increase thetest execution time without increasing the test suite coverage and effectiveness. Wepropose a technique that uses our test model to reorganize test cases in a way thatreduces the redundancy in the test suite. We implement our approach in a tool andevaluate it on four open-source softwares. Our empirical evaluation shows that ourapproach can reduce the number of redundant test cases up to 85% and the testexecution time by up to 2.5% while preserving the test suite’s behaviour.iiPrefaceThe work presented in this thesis was conducted by the author, Arash Vahabzadeh,under the supervision of Professor Ali Mesbah. Second chapter of this thesis wasalso in collaboration with Amin Milanifard. I was responsible for devising theapproach and the experiments, implementing the tools, running the experiments,evaluating and analyzing the results, and writing the manuscript. My collaboratorsguided me with the creation of the methodology and the analysis of results, as wellas editing and writing portions of the manuscript. Parts of the results described inchapter 2 of this thesis were published as a conference paper in September 2015in the Proceedings of the 31st International Conference on Software Maintenanceand Evolution (ICSME)[52]. The results described in chapter 3 of this thesis aresubmitted to an IEEE software testing conference and are currently under review.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . 52 An Empirical Study of Bugs in Test Code . . . . . . . . . . . . . . . 62.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . 72.1.2 Test Bug Categorization . . . . . . . . . . . . . . . . . . 102.1.3 Test Bug Treatment Analysis . . . . . . . . . . . . . . . . 132.1.4 FindBugs Study . . . . . . . . . . . . . . . . . . . . . . . 142.2 Prevalence of Test Bugs . . . . . . . . . . . . . . . . . . . . . . . 152.3 Categories of Test Bugs . . . . . . . . . . . . . . . . . . . . . . . 162.3.1 Silent Horror Test Bugs . . . . . . . . . . . . . . . . . . 16iv2.3.2 False Alarm Test Bugs . . . . . . . . . . . . . . . . . . . 202.4 Automatic Test Bug Classification . . . . . . . . . . . . . . . . . 282.5 Test Bugs vs Production Bugs . . . . . . . . . . . . . . . . . . . 292.6 FindBugs Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.6.1 Detected Bugs . . . . . . . . . . . . . . . . . . . . . . . 292.6.2 Categories of Test Bugs Detected by FindBugs . . . . . . 302.6.3 FindBugs’ Effectiveness in Detecting Test Bugs . . . . . . 312.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 Reducing Fine-Grained Test Redundancies . . . . . . . . . . . . . . 343.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.1.1 The Test Suite Model . . . . . . . . . . . . . . . . . . . . 353.1.2 Inferring The Model . . . . . . . . . . . . . . . . . . . . 423.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.2.1 Subject Systems . . . . . . . . . . . . . . . . . . . . . . 503.2.2 Procedure and Results . . . . . . . . . . . . . . . . . . . 503.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.3.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . 563.3.2 Relation to test suite reduction techniques . . . . . . . . . 573.3.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 573.3.4 Threats to validity . . . . . . . . . . . . . . . . . . . . . 583.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 655.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68vList of TablesTable 2.1 Top 10 ASF projects sorted by the number of reported test bugs. 15Table 2.2 Descriptive statistics of test bug reports. . . . . . . . . . . . . 16Table 2.3 Test bug categories for false alarms. . . . . . . . . . . . . . . . 19Table 2.4 Test code warnings detected by FindBugs. . . . . . . . . . . . 28Table 2.5 Performance of classifier for false alarm categories . . . . . . . 29Table 2.6 Comparison of test and production bug reports. . . . . . . . . . 30Table 2.7 Descriptive statistics of bugs reported by FindBugs. . . . . . . 30Table 3.1 Subject systems and their characteristics. . . . . . . . . . . . . 50Table 3.2 Partial redundancy in the test suites. . . . . . . . . . . . . . . . 51Table 3.3 Test reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . 51Table 3.4 TESTMODLER’s effectiveness in terms of execution time. . . . 54Table 3.5 Test suite’s coverage before and after reorganization. . . . . . . 56Table 3.6 TESTMODLER’s performance. . . . . . . . . . . . . . . . . . 56viList of FiguresFigure 1.1 Different scenarios for fixing test and production bugs. . . . . 3Figure 2.1 Overview of the data collection phase. . . . . . . . . . . . . . 8Figure 2.2 Bug reports collected from bug repositories and version controlsystems. |(A∪B)−C| = 5,556 test bug reports in total (|B−A−C|= 3,849, |A−C|= 1,707). . . . . . . . . . . . . . . 11Figure 2.3 Distribution of silent horror bug categories. . . . . . . . . . . 17Figure 2.4 An example of a silent horror test bug due to a fault in for loop. 18Figure 2.5 An example of a silent horror test bug due to a missing assertion. 18Figure 2.6 Distribution of false alarms. . . . . . . . . . . . . . . . . . . 21Figure 2.7 Percentage of subcategories of test bugs. . . . . . . . . . . . . 23Figure 2.8 Test bugs distribution based on testing phase in which bugsoccurred. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Figure 2.9 Resource handling bug pattern in test code. . . . . . . . . . . 27Figure 2.10 Distribution of warnings reported by FindBugs. . . . . . . . . 31Figure 3.1 Test Cases from Apache Commons Project. . . . . . . . . . . 36Figure 3.2 Test suite and production code interaction. . . . . . . . . . . . 37Figure 3.3 Extracted partial model for the running example. Ovals illus-trate test states and rectangles illustrate labels of test statementedges. Dotted lines represent the state compatibility relationsbetween test states and test statements. . . . . . . . . . . . . . 41Figure 3.4 Overview of our Approach. . . . . . . . . . . . . . . . . . . . 43Figure 3.5 Test suite after reorganization. . . . . . . . . . . . . . . . . . 45viiFigure 3.6 Clustering test cases for reorganizing. . . . . . . . . . . . . . 46Figure 3.7 Distribution of redundant statements and test cases in the testsuite. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52Figure 3.8 Comparison of optimal and actual reductions for the number oftest cases and test statements. . . . . . . . . . . . . . . . . . 55viiiAcknowledgmentsI would like to thank my supervisor, Dr. Ali Mesbah, for his unwavering support,careful supervision and invaluable guidance throughout the course of this research.Without his critical reviews and intellectual inputs, the completion of this thesiswould not have been possible for me. I am also sincerely thankful to Dr. IvanBeschastnikh and Dr. Sathish Gopalakrishnan for accepting to be a part of mydefence committee.I would also like to thank my friends and colleagues at Software Testing andAnalysis Lab who helped me with their support, encouragement and feedback. Lastbut certainly not least, I would like to thank my family specially my mother, fortheir love and endless support.ixChapter 1IntroductionSoftware testing is an essential part of software development. As software systemsare getting more complex in the last decades, developers rely more on softwaretesting to ensure the quality of these systems. Developers write test cases to verifyfunctionality of the software under test, detect bugs earlier in the software devel-opment process [58], and increase confidence and speed of software developmentactivities [14]. Test cases are also used as regression tests to make sure previouslyworking functionality still works, when the software evolves. Test code of a softwaresystem need to evolve alongside its production code [63]. New test cases need to beadded or existing test cases need to be modified to cover new functionalities andbug fixes. Over time the quality of the test code can degrade. The test code mightcontain bugs, or it can accumulate redundant test cases or very similar ones withmany redundant parts. The work presented in this thesis has focused on addressingthese issues by characterizing bugs in test code, and proposing a test model toanalyze test cases and support test reorganization to reduce redundancies in the testcode.Since test cases are code written by developers, they may contain bugs them-selves. In fact, it is stated [44] and believed by many software practitioners [18,36, 56] that “test cases are often as likely or more likely to contain errors thanthe code being tested”. Buggy tests can be divided into two broad categories [18].First, a fault in test code may cause the test to miss a bug in the production code(silent horrors). These bugs in the test code can cost at least as much as bugs in1the production code, since a buggy test case may miss (regression) bugs in theproduction code. These test bugs are difficult to detect and may remain unnoticedfor a long period of time. Second, a test may fail while the production code iscorrect (false alarms). While this type of test bugs is easily noticed, it can still takea considerable amount of time and effort for developers to figure out that the bugresides in their test code rather than their production code. Figure 1.1 illustratesdifferent scenarios of fixing these test bugs.Although the reliability of test code is as important as production code, unlikeproduction bugs [51], test bugs have not received much attention from the researchcommunity thus far. In chapter 2 we present an extensive study on test bugs thatcharacterizes their prevalence, impact, and main cause categories. To the best of ourknowledge, this work is the first to study general bugs in test code.We mine the bug report repository and version control systems of the ApacheSoftware Foundation (ASF), containing over 110 top-level and 448 sub open-sourceprojects with different sizes and programming languages. We manually inspect andcategorize randomly sampled test bugs to find the common cause categories of testbugs.Our results show that (1) around half of the Apache Software Foundationprojects have had bugs in their test code; (2) the majority (97%) of test bugs resultin false alarms, and their dominant root causes are “Semantic Bugs" (25%), “FlakyTests" (21%), “Environmental Bugs" (18%), “Inappropriate Handling of Resources"(14%), and “Obsolete Tests" (14%); (3) a minority (3%) of test bugs reported andfixed pertain to silent horror bugs with “Assertion Related Bugs" (67%) being thedominant root cause; (4) developers contribute more actively to fixing test bugs andtest bugs require less time to be fixed.The results of our study indicate that test bugs do exist in practice and their bugpatterns, though similar to that of production bugs, differ noticeably, which makescurrent bug detection tools ineffective in detecting them. Although current bugdetection tools such as FindBugs and PMD do have a few simple rules for detectingtest bugs, we believe that this is not sufficient and there is a need for extending theserules or devising new bug detection tools specifically geared toward test bugs.Over time, a test suite can accumulate redundant test cases [13, 33]. Redundancyin tests increases the maintenance overhead and the test execution time, without2Test PassesFix Test BugFix Production BugTest FailsBug in Test CodeTest PassesTest FailsTest PassesBug in Production codeFigure 1.1: Different scenarios for fixing test and production bugs.benefiting the test suite’s coverage or effectiveness. Different test-suite reduction(also called minimization) techniques [62] have been proposed for removing re-dundant test cases. However, test minimization techniques have two shortcomings,namely (1) they use code coverage as a guideline to remove whole redundant testcases. This can potentially remove a test case that has similar coverage as other testcases but different test statements and assertions. Assertions are known to directlyaffect test suite effectiveness [66], and (2) because they work at the whole test caselevel, they cannot target fine-grained redundancies within statements of a test case.To be able to remove low-level redundancies, we would need an approachthat can reorganize test cases at the statement level. However, reorganizing (orrefactoring) test cases in general is not a straight forward task. Developers use thetest suite to preserve the behaviour of the system when production code is refactored.However, there is not such a safety net when a test suite needs to go through internalreorganization. Thus, any technique for this purpose should reorganize test cases ina way that preserves the behaviour of the test suite.In chapter 3, we propose a fine-grained analysis approach for inferring a testmodel. Our test model can be used for performing test reorganizational tasks whileensuring that the behaviour of the test suite remains the same. As opposed to currenttest reduction techniques that use coverage criteria, we model the actual behaviourof the test suite by capturing the production method calls with their inputs. Ouranalysis is performed on the test statement level as opposed to whole test case level.We use our fine-grained analysis and test model to remove redundancies in testcases. Our empirical evaluation shows that our approach can reduce the number ofredundant test cases up to 85% and reduce the test execution time by up to 2.5%,while preserving the test suite behaviour.31.1 Research QuestionsTo improve the quality of test code we designed two high-level research questions:RQ1. What are the characteristics of bugs in test code?We conduct the first large scale empirical study of bugs in test code to characterizeprevalence, impact, and root cause categories of bugs in test code.RQ2. How can we automatically reduce statement-level redundancies in the testcode?We propose a test suite model to support statement-level test reorganization. Weuse our model to reorganize test statements in test cases in a way that reduces thestatement-level redundancies and execution time while preserving the test suitebehaviour.1.2 ContributionsOur work makes the following main contributions:• We mine 5,556 fixed bug reports reporting test bugs by searching through thebug repository and version control systems of the Apache projects.• We systematically categorize a total of 443 test bugs into multiple bug cate-gories.• We compare test bugs with production bugs in terms of the amount of attentionreceived and time to fix.• We assess whether existing bug detection tools such as FindBugs can detecttest bugs.• We propose a fine-grained test analysis method that works at the test statementlevel, a test model for identifying behaviour-preserving refactorings in a testsuite, and a technique and algorithm that uses our test model to reduceredundancies in the test suite by reorganizing partly redundant test cases andremoving the redundant parts.• We implement our approach in a tool called TESTMODLER, which is publiclyavailable [11].4• We empirically evaluate our approach by reorganizing the test suites of fourreal-world open source applications in a way that reduces their redundanciesand execution time.The following paper has been published in response to RQ1, and a papersubmission that addresses RQ2, is currently under review at an IEEE softwaretesting conference.– “An empirical study of bugs in test code" [52]. A. Vahabzadeh, A. MilaniFard, and A. Mesbah. In Proceedings of the International Conference onSoftware Maintenance and Evolution (ICSME). IEEE Computer Society,2015.1.3 Thesis OrganizationIn chapter 2 of this thesis, we present the experimental methodology, results, andimplications of the large-scale empirical study that we conducted to characterizeimpact and root causes of test bugs. In chapter 3, we describe in depth the proposedtest suite model and the automated technique that uses the test model to reorga-nize test cases in a way that reduces statement level redundancies and executiontime. Chapter 4 discusses the related work, and chapter 5 concludes the thesis anddescribes the possible future research directions.5Chapter 2An Empirical Study of Bugs inTest CodeSummary1Testing aims at detecting (regression) bugs in production code. However, testingcode is just as likely to contain bugs as the code it tests. Buggy test cases cansilently miss bugs in the production code or loudly ring false alarms when theproduction code is correct. We present the first empirical study of bugs in testcode to characterize their prevalence and root cause categories. We mine the bugrepositories and version control systems of 448 Apache Software Foundation (ASF)projects and find 5,556 test-related bug reports. We (1) compare properties of testbugs with production bugs, such as active time and fixing effort needed, and (2)qualitatively study 443 randomly sampled test bug reports in detail and categorizethem based on their impact and root causes. Our results show that (1) around halfof all the projects had bugs in their test code; (2) the majority of test bugs arefalse alarms, i.e., test fails while the production code is correct, while a minorityof these bugs result in silent horrors, i.e., test passes while the production code isincorrect; (3) incorrect and missing assertions are the dominant root cause of silenthorror bugs; (4) semantic (25%), flaky (21%), environment-related (18%) bugs are1This chapter is an extension of the study appeared at the 31st IEEE International Conference onSoftware Maintenance and Evolution (ICSME), 2015[52].6the dominant root cause categories of false alarms; (5) the majority of false alarmbugs happen in the exercise portion of the tests, and (6) developers contribute moreactively to fixing test bugs and test bugs are fixed sooner as compared to productionbugs. In addition, we evaluate the ability of existing bug detection tools to detectbugs in test code.2.1 MethodologyOur goal is to understand the prevalence and categories of bugs in test code. Weconduct quantitative and qualitative analyses to address the following researchquestions:RQ1: How prevalent are test bugs in practice?RQ2: What are common categories of test bugs?RQ3: Are test bugs treated differently by developers as compared to productionbugs?RQ4: Are current bug detection tools able to detect test bugs?All of our empirical data is available for download [1].2.1.1 Data CollectionFigure 2.1 depicts an overview of our data collection, which is conducted in twosteps: mining of bug repositories for test-related bug reports (A), and analyzingcommits in version control systems (B and C).Mining Bug RepositoriesOne of the challenges in collecting test bug reports is distinguishing between bugreports for test code and production code. In fact, most search and filtering tools incurrent bug repository systems do not support this distinction. In order to identifybug reports reporting a test bug, we selected the JIRA bug repository of the ApacheSoftware Foundation (ASF) since its search/filter tool allows us to specify the typeand component of reported issues. We mine the ASF JIRA bug repository [9], whichcontains over 110 top-level and 448 sub open-source projects, with various sizesand programming languages.7Version Control System(1,236,162commits)Select Commits  Associated with a Bug ReportFilterresolution=“Fixed”type=“Bugs”Extract Modified LocationsBug Reports Associated with a Test Commit(B)Bug Reports Associated with a Production Commit(C)Bug Repository(447,021 bug reports)Filterresolution=“Fixed”type=“Bugs”component=“test”Bug Reports Reporting a Bug in Test Component (A)Retrieve HEAD Commit For Each ProjectCompile and Run FindBugs on Test Code and Production CodeParse XML Output of FindBugsIdentify Test Bugs Based on Warning’s LocationTest Bugs Detected By FindBugs(D)Check Modified LocationsFigure 2.1: Overview of the data collection phase.We search the bug repository by selecting the type as “Bug", component as“test", and resolution as “Fixed".Type. The ASF JIRA bug report types can be either “Bug", “Improvement", “NewFeature", “Test", or “Task". However, we observed that most of the reported test-related bugs have “Bug" as their type. The “Test" label is mainly used when someoneis contributing extra tests for increasing coverage and testing new features.Component. The ASF bug repository defines components for adding structure toissues of a project, classifying them into features, modules, and sub-projects [45].We observed that many projects in ASF JIRA use this field to distinguish differentmodules of the project. Specifically, they use “test" for the component field to referto issues related to test code.8Resolution. We only consider bug reports with resolution “Fixed" because if areported bug is not fixed, it is difficult to verify that it is a real bug and analyze itsroot causes.Analyzing Version Control CommitsSince our search query used on the bug repository is restrictive, we might misssome test bugs. Therefore, we augment our data by looking into commits of theversion control systems of the ASF projects, Similar to [42]. We use the read-onlyGit mirrors of the ASF codebases [7], which “contain full version histories (includ-ing branches and tags) from the respective source trees in the official Subversionrepository at Apache"; thus using these mirrors does not threaten the validity of ourstudy. We observed that most commits associated with a bug report mention thebug report ID in the commit message. Therefore, we leverage this information todistinguish between bug reports reporting test bugs and production bugs. We extracttest bugs through the following steps:Finding Commits with Bug IDs. We clone the Git repository of each Apacheproject and use JGIT [8] to traverse the commits. In the ASF bug repository, everybug report is identified using an ID composed of {PROJECTKEY}-#BUGNUMwhere PROJECTKEY is a project name abbreviation. Using this pattern, we searchin the commit messages to find if a commit is associated with a bug report in JIRA.Once we have the ID, we can seamlessly retrieve the data regarding the bug reportfrom JIRA.Identifying Test Commits. For each commit associated with a bug report, wecompute the diff between that commit and its parents. This enables us to identifyfiles that are changed by the commit, which in turn allows us to identify test commits,i.e., commits that only change files located in the test directory of a project. Werefer to commits that change at least one file outside test directories2 as productioncommits. If a project is using Apache Maven, we automatically extract informationabout its test directory from the pom.xml file. Otherwise, we consider any directorywith “test" in its name as a test directory; we also manually verify that these are testdirectories.2We ignored auxiliary files such as .gitignore and *.txt.9This phase resulted in two sets of bug reports, namely (1) those associated witha test commit (block B in Figure 2.1), and (2) those associated with a productioncommit (block C in Figure 2.1). Since a bug report can be associated with bothtest and production commits, in our analysis we only consider bug reports that areassociated with test commits but not with any production commit (set B−C in thevenn diagram of Figure 2.2).2.1.2 Test Bug CategorizationManual Test Bug CategorizationTo find common categories of test bugs (RQ2), we manually inspect the test bugreports. Manual inspection is a time consuming task; on average, it took us around12 minutes per bug report to study the comments, patches, and source code of anychanged files. Therefore, we decided to sample the mined test-related bug reportsfrom our data collection phase.Sampling. We computed the union of the bug reports obtained from mining thebug reports (Section 2.1.1) and the version control systems (Section 2.1.1). Thisunion is depicted as a grey set of (A∪B)−C in the venn diagram of Figure 2.2. Werandomly sampled 499 (≈ 9.0% ) of the unique bug reports from this set.Categorization. For the categorization phase, we leverage information from eachsampled bug report’s description, discussions, proposed patches, fixing commitmessages, and changed source code files.First, we categorize each test bug in one of the two main impact classes, of falsealarms, i.e., test fails while the production code is correct, or silent horrors, i.e, testpasses while the production code is or could be incorrect. We adopt the terms falsealarms and silent horrors coined by Cunningham [18].Second, we infer common cause categories while inspecting each bug report.When three or more test bugs exhibited a common pattern, we added a new category.Subcategories also emerged to further subdivide the main categories.Finally, we also categorize test bugs with respect to the location (in the test case)or unit testing phase in which they occur as follows:10Bug Reports Associated with a Test Commit (B)Bug Reports that are in Test Component (A)Bug Reports Associated with a Production Commit (C)ABCFigure 2.2: Bug reports collected from bug repositories and version controlsystems. |(A∪B)−C|= 5,556 test bug reports in total (|B−A−C|=3,849, |A−C|= 1,707).1. Setup. Setting up the test fixture, e.g., creating required files, entries indatabases, or mock objects.2. Exercise. Exercising the software under test, e.g., by instantiating appropriateobject instances, calling their methods, or passing method arguments.3. Verify. Verifying the output or changes made to the states, files, or databasesof the software under test, typically through test assertions.4. Teardown. Tearing down the test, e.g., closing files, database connections,or freeing allocated memories for objects.The categorization step was a very time consuming task and was carried outthrough several iterations to refine categories and subcategories; the manual effortfor these iterations was more than 400 hours, requiring more than 100 hours foreach iteration.Automatic Test Bug CategorizationWe leveraged machine learning and natural language processing techniques tofurther categorize false alarm test bugs into their sub-categories. For each bug report11we used textual information of bug report such as title, description and its comments.We used other informations such as relation of this bug report to other bug reports(for example indication that this bug report is broken because of another bug report).We also used the bug report’s fixing commit and the changes that were made tothe source code. We performed the automatic categorization through the followingsteps :1. Preprocessing. Text level features: First, we performed a data cleaningphase, we deleted code snippets, stack traces and auto generated comments(auto generated comments about test results of commits and code qualitymeasurements) from bug reports. We also used a synonym list and replacedeach term with its synonym, for example all names of different operatingsystem such as Windows, Ubuntu, Cent OS, etc., are synonyms of word “Op-erating System" and replaced by this more general term. To leverage textualinformation of bug reports we used bag of word approach and computedTerm Frequency-Inverse Document Frequency (TF-IDF) after performingstemming.Source level features: We also used information stored in fixing commit ofbug reports, we used GumTree AST differencing tools [22] to compute thechanges made to the source code by each fixing commit. For each commit wecollected name of method calls that one of their arguments is changed as partof commit, percentage of changes made to body of control flow statements(if, while, for), name of methods that their body has changed (especiallyif tearDown and setUp methods are changed), class accesses, instantiatedclasses, string arguments added and method annotation changes. Since, mostof projects in our set of test bug reports use Java programming language weonly consider those bug reports that their programming language is Java.2. Training. For training we used Support Vector Machine with SequentialMinimal Optimization (SMO) machine learning algorithm. To train theSupport Vector Machine we used the dataset of labelled bug reports collectedby manual test bug categorization phase. We randomly selected 70% of bugreports for training and validation, and used the other 30% of bug reports fortesting. In this way, since the testing set is not used for training, performance12of our classifier on this testing set can be representative of its performance ondata set of whole Jira bug repository’s test bug reports.3. Classifier Evaluation. To evaluate our classifier we measured its perfor-mance on the test set, we measured precision, recall and F1 for each categoryclassified by the classifier.4. Automatic Classification. We used our classifier to automatically classifyall test bug reports we gathered by our data collection phase.2.1.3 Test Bug Treatment AnalysisTo answer RQ3, we measure the following metrics for each bug report:Priority: In JIRA, the priority of a bug report indicates its importance in relationto other bug reports. For the Apache projects we analyzed, this field hadone of the following values: Blocker, Critical, Major, Minor or Trivial. Forstatistical comparisons, we assign a ranking number from 5 to 1 to each,respectively.Resolution time: The amount of time taken to resolve a bug report starting fromits creation time.Number of unique authors: Number of developers involved in resolving the issue(based on their user IDs).Number of comments: Number of comments posted for the bug report. It capturesthe amount of discussions between developers.Number of watchers: Number of people who receive notifications; an indicationof the number of people interested in the fate of the bug report.The metrics priority, number of unique authors, comments, and watchers canbe used as an indication as to what extent developers contribute for fixing a bug.Resolution time metric indicates how much time is needed for fixing a particularbug. We collected these metrics for all the test bug reports and all the productionbug reports, separately. For the comparison analysis, we only included projectsthat had at least one test bug report. To obtain comparable pools of data points, thenumber of production bug reports that we sampled, were the same as the number oftest bug reports mined from each project.132.1.4 FindBugs StudyTo answer RQ4, we use FindBugs [35], a popular static byte-code analyzer inpractice for detecting common patterns of bugs in Java code. We investigate itseffectiveness in detecting bugs in test code.Detecting Bugs in TestsWe run FindBugs (v3.0.0)3 on the test code as well as the production code oflatest version of Java ASF projects that use Apache Maven (see Figure 2.1 (D)).Compiling projects that do not use Maven requires much manual effort, for instancein resolving dependencies on third party libraries. Also we noticed that FindBugscrashes while running on some of the projects. In total, we were able to successfullyrun FindBugs on 129 of the 448 ASF sub-projects.Analysis of Bug Patterns Found by FindBugsWe parse the XML output of FindBugs and extract patterns from the reported bugs.FindBugs statically analyzes byte code of Java programs to detect simple patternsof bugs in the byte code. This is done by applying static analysis techniques such ascontrol and data flow analyses. Among patterns of bugs that FindBugs detects, weonly considered reported Correctness and Multithreaded Correctness as others,such as internationalization, bad practice, security or performance, are more relatedto non-functional bugs.Effectiveness in Detecting Test BugsTo evaluate FindBugs’ effectiveness in detecting test bugs, we choose a similarapproach used by Couto et al. [17]. We sample 50 bug reports from projects thatwe can compile the version containing the bug, just before the fix. By comparingthe versions before and after a fix, we are able to identify the set of methods that arechanged as part of the fix. We run FindBugs on the version before and after the fixto see if FindBugs is able to detect the test bug and could have potentially preventedit. If FindBugs reports any warning in any of the methods changed by the fix andthese warnings disappear after the fix, we assume that FindBugs is able to detect the3http://findbugs.sourceforge.net14Table 2.1: Top 10 ASF projects sorted by the number of reported test bugs.Project Production Test # TestCode Code BugKLOC KLOC ReportsDerby 386 370 614HBase 587 195 440Hive 836 124 295Hadoop HDFS 101 57 286Hadoop Common 1249 380 279Hadoop Map/Reduce 60 24 231Accumulo 405 78 187Qpid 553 93 152Jackrabbit Content Repository 247 107 145CloudStack 1361 228 111associated test bug. Note that it is possible that the disappeared warnings are falsepositives or unrelated to actual fix. However, we make this assumption to make theresults comparable to the results obtained by [17] for the efficacy of static analysistools on production bugs. We also manually examine the these warnings to makesure they are related to the fix and the associated test bug.The next four sections present the results of our study for each research question,subsequently.2.2 Prevalence of Test BugsOverall, our analysis reveals that 47% of the ASF sub-projects (211 out of 448)have had bugs in their tests. Our search query on the JIRA bug repository retrieved2,040 bug reports. After filtering non-test related reports, we obtained 1,707 test bugreports, shown as A−C in the venn diagram of Figure 2.2. The search in versioncontrol systems resulted in 4,982 bug reports associated only with test commits,depicted as the set B−C in Figure 2.2. In total, we found 5,556 unique test bugreports ((A∪B)−C). Table 2.2 presents descriptive statistics for the number of testbug reports and Table 2.1 shows the top 10 ASF projects in terms of the number oftest bug reports we found in their bug repository4. For additional fine-grained dataper project we refer the reader to our online report [1].4Source lines of code is for all programming languages used in project, measured with CLOC:http://cloc.sourceforge.net15Table 2.2: Descriptive statistics of test bug reports.Min Mean Median σ Max Total0 12.4 0 48.3 614 5,556Finding 1: Around half of all the projects analyzed had bugs in their test codethat were reported and fixed. On average, there were 12.4 fixed test bugs perproject.2.3 Categories of Test BugsWe manually examined the 499 sampled bug reports; 56 of these turned out to bedifficult to categorize due to a lack of sufficient information in the bug report. Wecategorized the remaining 443 bug reports. Table 2.3 shows the main categories andtheir subcategories that emerged from our manual analysis. Our results show that alarge number of reported test bugs result in a test failure (97%), and a small fractionpertains to silent test bugs that pass (3%).2.3.1 Silent Horror Test BugsSilent test bugs that pass are much more difficult to detect and report compared tobuggy tests that fail. Hence, it is not surprising that only about 3% of the sampledbug reports (15 out of 443) belong to this category.Figure 2.3 depicts the distribution of silent horror bug categories in terms of thelocation of the bug. In five instances, the fault was located in the exercise step ofthe test case, i.e., the fault caused the test not to execute the SUT for the intendedtesting scenario, which made the test useless. For instance, as reported in bugreport JCR-3472, due to a fault in the test code of the Apache Jackrabbit project,queries in LargeResultSetTest run against a session where the test contentis not visible and thus the resulting set is empty and the whole test is pointless. Inanother example, due to the test dependency between two test cases, one of testcases “is actually testing the GZip compression rather than the DefaultCodec due tothe setting hanging around from a previous test” (FLUME-571). Such issues couldexplain why these bugs remain unnoticed and are difficult to detect.16Exercise	  33%	  Asser,on	  Fault	  40%	  Missing	  Asser,on	  60%	  Verify	  67%	  Figure 2.3: Distribution of silent horror bug categories.The other 10 instances were located in the verification step, i.e., they all involvedtest assertions. From these, six pertained to a missing assertion and four were relatedto faults in the assertions, which checked a wrong condition or variable.Interestingly, two of the silent test bugs resulted in a failure when they werefixed, indicating a bug in the production code that was silently ignored. For ex-ample, in ACCUMULO-1878, 1927, 1988 and 1892, since the test did notcheck the return value of the executed M/R jobs, these jobs were failing silently(ACCUMULO-1927), when this was fixed, the test failed. Figure 2.4 shows thefixing commit for HBASE-7901, a bug in the for loop condition that caused thetest not to execute the assertion.Although JUnit 4 permits to assert a particular exception through the expectedannotation and ExpectedException rule, many testers are used to or prefer[26] using the traditional combination of try/catch and fail() assertion typeto achieve this goal. However, this pattern tends to be error-prone. In our sampledlist, four out of 15 silent bugs involved incorrect usage of the try/catch and incombination with the fail() primitive. For example, Figure 2.5 shows the fixingcommit for the bug report JCR-500; the test needs to assert that unregistering anamespace that is not registered should throw an exception. However, a fail()assertion is missing from the code making the whole test case ineffective. Anotherpattern of this type of bug is when the SUT in the try block can throw multipleexceptions and the tester does not assert on the type of the thrown exception. It is171 -for (int j = 0; i < cr.getFiles().size(); j++) {2 +for (int j = 0; j < cr.getFiles().size(); j++) {3 assertTrue(cr.getFiles().get(j)4 .getReader().getMaxTimestamp() < (System.currentTimeMillis() - ←↩;Figure 2.4: An example of a silent horror test bug due to a fault in for loop.1 try {2 nsp.unregisterNamespace("NotCurrentlyRegistered");3 + fail("Trying to unregister an unused prefix must fail");4 } catch (NamespaceException e) {5 // expected behaviour6 }Figure 2.5: An example of a silent horror test bug due to a missing assertion.worth mentioning that two of these 15 bugs could have potentially been detectedstatically; in one case (ACCUMULO-828), the whole test case did not have anyassertions, and in another (SLIDER-41) a number of test cases were not executedbecause they did not comply with the test class name conventions of Maven, i.e.,their name did not start with “Test".Finding 2: Silent horror test bugs form a small portion (3%) of reported test bugs.Assertion-related faults are the dominant root cause of silent horror bugs.18Table 2.3: Test bug categories for false alarms.Category Subcategory DescriptionSemantic BugsS1. Assertion Fault Fault in the assertion expression or arguments of a test case.S2. Wrong Control Flow Fault in a conditional statement of a test case.S3. Incorrect Variable Usage of the wrong variable.S4. Deviation from Test Requirement andMissing CasesA missing step in the exercise phase, missing a possible scenario, or when test case devi-ates from actual requirements.S5. Exception Handling Wrong exception handling.S6. Configuration Configuration file used for testing is incorrect or test does not consider these configura-tions.S7. Test Statement Fault or Missing State-mentsA statement in a test case is faulty or missing.EnvironmentE1. Differences in Operating System Tests in this category pass on one OS but fail on another one.E2. Differences in third party libraries orJDK versions and vendorsFailure is due to incompatibilities that exist between different versions of JDK or differentimplementations of JDK by different vendors, or different versions of third party libraries.Resource Handling I1. Test Dependency Running one test affects the outcome of other tests.I2. Resource Leak A test does not release its acquired resources properly.Flaky Tests F1. Asynchronous Wait Test failure is due to an asynchronous call and not waiting properly for the result of thecall.F2. Race Condition Test failure is due to non-deterministic interactions of different threads.F3. Concurrency Bugs Concurrency issues such as deadlocks and atomicity violations.Obsolete tests O1. Obsolete Statements Statements in a test case are not evolved when production code has evolved.O2. Obsolete Assertions Assertion statements are not evolved as production code evolves.192.3.2 False Alarm Test BugsWe categorized the 428 bug reports that were false alarms based on their root cause.We identified five major causes for false alarms. Figure 2.6 shows the distributionfor each main category and also testing phase in which false alarm bug occurred.Finding 3: Semantic bugs (25%) and Flaky tests (21%) are the dominant rootcauses of false alarms, followed by Environment (18%) and Resource handling(14%) related causes. The majority of false alarm bugs occur in the exercisephase of testing.Semantic BugsThis category consists of 25% of the sampled test bugs. Semantic bugs reflectinconsistencies between specifications and production code, and test code. Basedon our observations of common patterns of these bugs, we categorized them intoseven subcategories as shown in Table 2.3. Figure 2.7a presents percentages ofeach subcategory, and Figure 2.8a shows the fault location distribution in the testingphase.The majority of test bugs in semantic bug category (33%) belongs to tests thatmiss a case or deviate from test requirements (S4). Examples include tests thatmiss setting some required properties of the SUT (e.g., CLOUDSTACK-2542 andMYFACES-1625), or tests that miss a required step to exercise the SUT correctly(e.g., HDFS-824). Test statement faults or missing statements account for 19%of bugs in this category. For example in CLOUDSTACK-3796, a statement faultresulted in ignoring to set the attributes needed for setting up the test correctly, thusresulting in a failure. The use of an incorrect variable, which may result in assertingthe wrong variable (e.g., DERBY-6716) or a wrong test behaviour was observed in9% of the semantic bugs. 7% of semantic bugs in our sampled bugs were due toimproper exception handling in test code, which resulted in false test failures (e.g.,JCR-505). Some tests require reading properties from an external configurationfile to run with different parameters without changing the test code itself; however,some tests did not use these configurations properly or in some other cases theseconfigurations were buggy themselves. 7% of the false alarm bugs had this issue.We categorized a bug in the wrong control flow category if the test failed due to20Environment18%Flaky Test21%Resources14%Semantic Bugs25%Obsolete Tests14%Other8%(a) Distribution based on bug categories.setup	  28%	  exercise	  34%	  verify	  24%	  teardown	  14%	  (b) Distribution based on testing phases.Figure 2.6: Distribution of false alarms.a fault in a conditional statement (e.g., if, for or while conditional). 5% ofsemantic bugs belong to this category. Another 5% of semantic bugs were due tofaulty assertions (e.g., JCR-503).Finding 4: Deviations from test requirements or missing cases in exercising theSUT (33%) and faulty or missing test statements (19%) are the most prevalentsemantic bugs in test code.21EnvironmentAround 18% of bug reports pertained to a failing test due to environmental issues,such as differences in path separators in Windows and Unix systems. In this case,tests pass under the environment they are written in, but fail when executed in adifferent environment. Since open source software developers typically work indiverse development environments, this category accounts for a large portion of thetest bug reports filed.22Devia&on	  From	  Requirement	  and	  missing	  cases	  33%	  S3	  9%	  S1	  5%	  Missing	  Normal	  Statement	  and	  Fault	  19%	  S5	  7%	  S2	  5%	  S6	  7%	  Other	  15%	  (a) Semantic bug.JDK	  &	  Third	  Party	  Library	  26%	  OS	  61%	  other	  13%	  (b) Environmental bugs.Resource	  Leak	  31%	  Test	  Dependency	  61%	  Other	  8%	  (c) Resource related.Async	  Wait	  46%	  Concurrency	  37%	  Race	  Condi7on	  12%	  Other	  5%	  (d) Flaky tests.Obsolete Assertion23%Obsolete Normal Statement77%(e) Obsolete tests.Figure 2.7: Percentage of subcategories of test bugs.23Figure 2.7b and Figure 2.8b show the distribution of environmental bugs andtheir fault locations. About 61% of the bug reports in this category were due tooperating system differences (E1), and particularly differences between the Win-dows and Unix operating systems. Testers make platform-specific assumptionsthat may not hold true in other platforms — e.g., assumptions about file pathand classpath conventions, order of files in a directory listing, and environmentvariables (MAPREDUCE-4983). Some of the common causes we observed thatresult in failing tests in this category include: (1) Differences in path conventions— e.g., Windows paths are not necessarily valid URIs while Unix paths are, orWindows uses quotation for dealing with spaces in file names but in Unix spacesshould be escaped (HADOOP-8409). (2) File system differences — e.g., in Unixone can rename, delete, or move an opened file while its file descriptor remainspointing to a proper data; however, in Windows opened files are locked by defaultand cannot be deleted or renamed (FLUME-349). (3) File permission differences— e.g., default file permission is different on different platforms. (4) Platform-specific use of environmental variables — e.g., Windows uses the %ENVVAR%and Unix uses the $ENVVAR notations to retrieve environmental variable values(MAPREDUCE-4869). Also classpath entries are separated by ‘;’ in Win-dows and by ‘:’ in Unix.Differences in JDK versions and vendors (E2) were responsible for 26% ofenvironment related test bugs. For example, with IBM JDK developers should useSSLContext.getInstance(‘‘SSL_TLS") instead of “SSL" in OracleJDK, to ensure the same behaviour (FLUME-2441). There is also compatibilityissues between different versions of JDKs, e.g., testers depended on the order ofiterating a HashMap, which was changed in IBM JDK 7 (FLUME-1793).Finding 5: 61% of environmental false alarms are platform-specific failures,caused by operating system differences.24Setup26%Exercise35%Verify35%Teardown4%(a) Semantic bug.Setup35%Exercise25%Verify31%Teardown9%(b) Environmental bugs.Setup27%Exercise8%Verify7%Teardown58%(c) Resource related.Setup	  25%	  Exercise	  51%	  Verify	  22%	  Teardown	  2%	  (d) Flaky tests.Setup	  20%	  Exercise	  47%	  Verify	  27%	  Teardown	  6%	  (e) Obsolete tests.Figure 2.8: Test bugs distribution based on testing phase in which bugs occurred.25Inappropriate Handling of ResourcesIdeally, test cases should be independent of each other, however, in practice this isnot always true, as reported in a recent empirical study [65]. Around 14% of bugreports (61 out of 428) point to inappropriate handling of resources, which maynot cause failures on their own, but cause other dependent tests to fail when thoseresources are used. Figure 2.7c shows the percentage for sub-categories of resourcehandling bugs and Figure 2.8c shows the distribution of testing phases in which thefault occurs. About 61% of these bugs were due to test dependencies.A good practice in unit testing is to mitigate any side-effects a test executionmight have; this includes releasing locally used resources and rolling back possiblechanges to external resources such as databases. Most of unit testing frameworksprovide opportunities to clean up after a test run, such as the tearDown methodin JUnit 3 or methods annotated with @After in JUnit 4. However, testers mightforget or fail to perform this clean up step properly. One common mistake is when atest that changes some persistent data (or acquires some resources), conducts theclean up in the test method’s body. In this case, if the test fails due to assertionfailures, exceptions or time outs, the clean up operation will not take place causingother tests or even future runs of this test case to fail. Figure 2.9 illustrates this bugpattern and its fix. Another common problem we observed is that testers forgot tocall the super.tearDown() or super.setUp() methods and this preventstheir superclass to free acquired resources (DERBY-5726). Bug detection toolssuch as FindBugs can detect these types of test bugs.Finding 6: 61% of inappropriate resource handling bugs are caused by dependenttests. More than half of all resource handling bugs occur in the teardown phaseof test cases.Flaky TestsThese test bugs are caused by non-deterministic behaviour of test cases, whichintermittently pass or fail. These tests, also known as ‘flaky tests’ by practitioners,are time consuming for developers to resolve, because they are hard to reproduce[21]. A recent empirical study on flaky tests [42] revealed that the main root causefor flaky tests is Async Wait, which happens when a test does not wait properly for261 @Test2 public void test(){3 acquireResources();4 assertEquals(a,b);5 releaseResources();6 }(a) Buggy test.1 @Before2 public void setUp(){3 acquireResources();4 }5 @Test6 public void test() {7 assertEquals(a,b);8 }9 @After10 public void tearDown(){11 releaseResources();12 }(b) Fixed test.Figure 2.9: Resource handling bug pattern in test code.a asynchronous call, and Race Condition, which is due to interactions of differentthreads, such as order violations. Our results are also in line with their findings; wefound that not waiting properly for asynchronous calls (46%) is the main root causeof flaky tests, followed by race conditions between different threads (Figure 2.7d).As shown by Figure 2.8d, most of flaky test bugs (51%) are due to bugs in exercisephase of tests.Finding 7: The majority of flaky test bugs occur when the test does not waitproperly for asynchronous calls during the exercise phase of testing.Obsolete TestsIdeally, test and production code should evolve together, however, in practice this isnot always the case [63]. An obsolete test [32] is a test case that is no longer validdue to the evolution of the specifications and production code of the program undertest. Obsolete tests check features that have been modified, substituted, or removed.When an obsolete test fails, developers spend time examining recent changes made27Table 2.4: Test code warnings detected by FindBugs.Bug Description Bug PercentageCategoryInconsistent synchronization Flaky 29.8%Possible null pointer dereference in method onexception pathSemantic 17.6%Using pointer equality to compare differenttypesSemantic 8.8%Possible null pointer dereference Semantic 7.3%Class defines field that masks a superclass field Semantic 3.9%Nullcheck of value previously dereferenced Semantic 2.9%An increment to a volatile field isn’t atomic Flaky 2.9%Method call passes null for nonnull parameter Semantic 2.4%Incorrect lazy initialization and update of staticfieldFlaky 2.4%Null value is guaranteed to be dereferenced Semantic 2.0%to production code as well as the test code itself to figure out that the failure is not abug in production code.As shown in Figure 2.8e, developers mostly need to update the exercise phaseof obsolete tests. This is expected as adding new features to production code maychange the steps required to execute the SUT, however, may not change the expectedcorrect behaviour of the SUT, i.e., assertions. In fact, as depicted in Figure 2.7e,only 23% of obsolete tests required a change to assertions.Finding 8: The majority of obsolete tests require modifications in the exercisephase of test cases, and mainly in normal statements (77%) rather than assertions.2.4 Automatic Test Bug ClassificationWe used the training set which is randomly chosen 70 percent of our sampledset for choosing the best set of features and comparing different algorithms. Forevaluation we used the remaining part of our sampled set (30 percent) as our testset. Table 2.5 shows performance of our classifier in terms of precision, recall andF1 metrics. We ran it on all of our false alarm test bug reports set and automaticallycategorized them into semantic, flaky, environmental, resource related and obsoletecategories. Table 2.5 also shows percentage of each category obtained by automaticclassification. As shown in Table 2.5, the percentages obtained by automatic28Table 2.5: Performance of classifier for false alarm categoriesCategory Percentage in Jira Precision Recall F-MeasureSemantic 29.21 % 0.75 0.75 0.75Flaky 18.19 % 0.78 0.7 0.74Environmental 20.47 % 0.62 0.77 0.68Resources 10.20 % 0.71 0.39 0.5Obsolete 21.93 % 0.34 0.5 0.4categorization is in line with the manual categorization. This means that oursampled data set was representative of the whole data set.2.5 Test Bugs vs Production BugsTable 2.6 shows the median, mean, standard deviation, and maximum of each metricdefined in subsection 2.1.3, for test bugs (TE) and production bugs (PR). We usedthe nonparametric Mann-Whitney U tests to compare the distribution of test bugswith that of production bugs and compute the p-values. The p-value is close to zerobecause of a large sample size effect; we computed effect size — standardized meandifference (d) and odds ratio (OR) — to compare the meaningfulness of differences.The results indicate that test bugs take less time to be fixed compared to productionbugs. Although the priority assigned to test bugs and production bugs have a similardistribution, developers seem to contribute more to fixing test bugs as both medianand mean for the number of unique authors, watchers and comments for test bugsare higher than production bugs.Finding 9: On average, developers contribute more actively to fixing test bugscompared to production bugs and test bugs are resolved faster than productionbugs.2.6 FindBugs Study2.6.1 Detected BugsFindBugs reported 205 Correctness and Multithreaded Correctness warnings inthe test code of 20 out of 129 ASF projects that we were able to compile and run29Table 2.6: Comparison of test and production bug reports.Metric Type Med Mean SD Max d OR p-valuePriority PR 3.00 2.91 0.76 5.00 -0.13 0.78 4.9e-14TE 3.00 2.80 0.75 5.00ResolutionTime(days)PR 6.39 109.70 282.04 2843.56 -0.20 0.69<2.2e-16TE 2.77 58.97 213.72 2666.60#Comments PR 3.00 4.91 6.74 101.00 0.15 1.31TE 4.00 5.88 6.26 99.00#Authors PR 2.00 2.41 1.53 18.00 0.31 1.77TE 2.00 2.89 1.53 12.00#Watchers PR 0.00 1.32 2.04 24.00 0.25 1.58TE 1.00 1.84 2.06 16.00the tool on. Table 2.7 summarizes descriptive statistics for the number of reportedwarnings. For additional fine-grained data per project we refer the reader to [1].Table 2.7: Descriptive statistics of bugs reported by FindBugs.Min Mean Median σ Max Total0 1.6 0 5.5 48 2052.6.2 Categories of Test Bugs Detected by FindBugsTable 2.4 shows the top 10 most frequent potential test bugs detected by FindBugsand their percentage. Figure 2.10 shows distribution of different categories of thesewarnings. We consider Multithreaded Correctness warnings reported by FindBugsas flaky test bugs since these concurrency-related warnings can potentially causethe test to pass and fail intermittently. We also consider the Correctness warningsas semantic bugs, i.e., inconsistencies between actual and intended behaviour ofthe test. FindBugs has also one rule to detect bugs that cause different behavioursin Linux and Windows due to path separator differences, i.e., environment relatedbugs.FindBugs has six test related rules that are part of the included correctness rules.However, FindBugs did not report any warnings related to these categories. Asdepicted in Figure 2.10, semantic related warnings and flaky ones are the majorwarnings reported by FindBugs.30Environment	  1%	  Flaky	  43%	  Seman5c	  56%	  Figure 2.10: Distribution of warnings reported by FindBugs.2.6.3 FindBugs’ Effectiveness in Detecting Test BugsEarlier studies [17, 46] report that static code analysis tools are able to detect 5-15 %of general bugs in software projects. We wanted to know how they perform on testbugs. We sampled 50 bug reports out of 623 bug reports that changed Java sourcefiles and we were able to compile their project when we checked out the version justbefore the fix. Among these 50 sampled bug reports, in 3 (6%) instances at leastone Findbugs’ warning disappeared after the fix. We analyzed these 3 instancesmanually and found out one of them was actually a false positive and the other twowarnings were not directly related to the bug report. This means, FindBugs was notable to detect any of the 50 test bugs.Finding 10: FindBugs could not detect any of the test bugs in our sampled 50bug reports.2.7 DiscussionOur study has implications for both testers and developers of bug detection tools.The results of this study imply that test code is susceptible to bugs just like pro-duction code (Finding 1). Test code is supposed to guard production code againstpotential (future) bugs and thus should be bug free itself. However, current bugdetection tools are mainly geared towards production code. For instance, FindBugshas only six bug detection patterns dedicated for testing code among its 424 bugpatterns [6]. Similarly, PMD, another popular static bug detection tool for Java, hasonly 12 bug pattern rules for bugs in JUnit [10]. Moreover patterns of environmental31bugs, flaky tests, and resource handling bugs in test code differ from productioncode, making current bug detection tools unable to detect them in test code. Forexample the latest version of FindBugs detects run-time error handling mistakesbased on the method proposed by [57]. We identified a similar pattern of this bug intest code (Figure 2.9). However, because of a slight change of pattern in the testcode, FindBugs was not able to detect this.In our study, FindBugs generated an average of 1.6 warnings in the test code of129 open source projects, most of which fall into semantic and flaky test categories,the most prevalent categories of bug reports (Findings 2 and 4). However, many ofthe reported bugs cannot be simply detected using current static bug detection tools.This is particularly true for the silent horror bugs, which are mostly due to assertionrelated faults (Finding 3). Finding automated ways of detecting silent horror testbugs could be of great value to developers, since such bugs are extremely difficultto detect.Our results show that a large portion of bugs in test code belongs to semanticbugs, i.e., test code does not properly test production code (Finding 4). Any methodthat can enhance developers’ understanding of the requirements, the software undertest, and its valid usage scenarios, can help to reduce the number of semantic bugsin test code.Compared to production bugs, we find that test bugs receive more attentionfrom developers and are fixed sooner. This might be because the majority of thetest bugs result in a test failure, which is difficult to ignore for developers. Anotherexplanation could be that bugs in test code might be easier to fix than bugs inproduction code.Threats to Validity. An internal validity threat is that the categorization of bugreports was made by two of the co-authors, which may introduce author-bias. Tomitigate this, we conducted a review process in which each person reviewed thecategorization done by the other person. Regarding the test bugs detected byFindBugs, we did not manually inspect each to see if is indeed a real bug. However,since we chose only the Correctness categories of FindBugs, we believe the reportedbugs are issues in the test code that need to be fixed.In terms of external threats, our results are based on bug reports from a numberof experimental objects, which calls into question the representativeness. How-32ever, we believe that the chosen 448 ASF projects are representative of real-worldapplications as they vary in domains such as desktop applications, databases anddistributed systems, and programming languages such as Java, C++ and Python. Inaddition, we focus exclusively on bug reports that were fixed. This decision wasmade since the root cause would be difficult to determine from open reports, whichhave no corresponding fix. Further, a fix indicates that there was indeed a bug in thetest code.2.8 ConclusionsThis work presents the first large-scale study of test bugs. Test bugs may causea test to fail while production code is correct (false alarms), or may cause a testto pass, while the production code is incorrect (silent horrors). Both are costlyfor developers. Our results show that test bugs are in fact prevalent in practice,the majority are false alarms, and semantic bugs and flaky tests are the dominantroot causes of false alarms, followed by environment and resource handling relatedcauses. Our evaluation reveals that FindBugs, a popular bug detection tool, is noteffective in detecting test bugs.33Chapter 3Reducing Fine-Grained TestRedundanciesSummaryDevelopers write tests to ensure the correct behaviour of production code. Testcases need to be modified as the software evolves. Over time, tests can accumulateredundancies, which in turn increase the test execution time and overhead of main-taining the test suite. Test reduction techniques identify and remove redundant testcases of a given test suite. However, these techniques remove whole test cases anddo not address the issue of partly redundant test cases. In this chapter, we proposean approach for performing fine-grained analysis of test cases to support test casereorganization, while preserving the behaviour of the test suite. Our analysis isbased on the inference of a test suite model that enables test reduction at the teststatement level. We evaluate our technique on the test suites of four real-worldopen source projects. Our results show that our technique can reduce the number ofpartly redundant test cases up to 85% and the test execution time up to 2.5%, whilepreserving the test suite’s behaviour.343.1 ApproachTo enable a fine-grained analysis of test cases, we first present a model that capturesthe relationship between test states and test statements. We then describe howwe automatically infer this model from a given test suite. Finally, we present atechnique that uses this inferred model to remove fine-grained redundancies byreorganizing test cases.3.1.1 The Test Suite ModelThere are a number of properties that our model needs to exhibit. First, the modelshould capture how the test suite essentially tests the production code; this isimportant to preserve the behaviour of the test suite after any refactoring activity.Second, the model should capture dependencies at the test statement level to supporttest reorganization; since a test statement might have dependencies on previousstatements, it is not possible to freely move test statements between test cases.Finally, the model should facilitate the discovery and removal of redundancies intest cases.Figure 3.1 shows four test cases of the test class ComplexTest scraped fromthe Apache Math Commons [3], which test add, subtract, multiply and dividefunctionality for the Complex class. We use Figure 3.1 as a running example.We refer to each statement in a test case as a test statement (st). In this caseeach test case is a sequence of test statements. For example, we refer to each lineinside the test cases of Figure 3.1 as a test statement. Note that assertions are also aparticular type of test statements.A unit test case typically creates a set of variables (e.g., objects) and assignsvalues to their (member) variables, then it calls the production method under testusing those variables as inputs, and finally it asserts the method’s returned value.Our test model needs to capture these three entities, namely, variables and theirvalues, production method calls, and test assertions.Definition 1 (Variable Value). The value of a variable x (Val(x)) is defined as:351 @Test2 public void testAdd() {3 Complex x = new Complex(3.0, 4.0);4 Complex y = new Complex(5.0, 6.0);5 Complex z = x.add(y);6 assertEquals(8.0, z.getReal(), 1.0e-5);7 }8 @Test9 public void testSubtract() {10 Complex x = new Complex(3.0, 4.0);11 Complex y = new Complex(5.0, 6.0);12 Complex z = x.subtract(y);13 assertEquals(-2.0, z.getReal(), 1.0e-5);14 }15 @Test16 public void testMultiply() {17 Complex x = new Complex(3.0, 4.0);18 Complex y = new Complex(5.0, 6.0);19 Complex z = x.multiply(y);20 assertEquals(-9.0, z.getReal(), 1.0e-5);21 }22 @Test23 public void testDivide() {24 Complex dividend = new Complex(3.0, 4.0);25 Complex divisor = new Complex(5.0, 6.0);26 Complex q;27 q = dividend.divide(divisor);28 assertEquals(39.0 / 61.0, q.getReal(), 1.0e-5);29 }Figure 3.1: Test Cases from Apache Commons Project.Val(x) =primitive_value : Type(x) ∈ P{(xi,Val(xi))|xi ∈ Fields(x)} : Type(x) 6∈ PType(x) denotes the type of the variable x, P is the set of all primitive types,and Fields(x) denotes the set of all member variables of the object x. If the variable(x) is an object, its value is a set of (xi,Val(xi)) pairs where xi is the name of ithmember variable in x; this includes Private, Protected and Public member variablesof the object in Java. Otherwise, if the variable is of a primitive type, Val(x) is thevariable’s primitive value. For example the value of the object x in Figure 3.1 at line5 is Val(x) = {(Complex.r,3.0),(Complex.i,4.0)} given that the Complex classhas two member variables of type int named r and i.36Test Suite Production Code{(Complex.add, {(Complex, {(Complex.r, 3.0), (Complex.i, 4.0)}),(Complex, {(Complex.r, 5.0), (Complex.i, 6.0)})}Figure 3.2: Test suite and production code interaction.In order to preserve a test suite’s behaviour, we look at the test suite as a blackbox and capture all the externally observable behaviours of the test suite. We referto methods in the production code that are under test as production methods. Theexternal behaviour of the test suite can be modelled by capturing the productionmethods that it calls along with their inputs.Definition 2 (Production Method Calls (PMC)). The Production Method Calls ofa test statement (PMC(st)) is the set of production methods that the statement callswith their inputs. The PMC of a test statement is a set of (MethodNamei, InputSeti)pairs, in which MethodNamei is the called production method’s qualified name,and InputSeti is the ordered set of (Type(x j),Val(x j)) pairs for each input variablex j of the method starting with this object for non-static member functions.For example, the PMC for the test statement of line 5 in Figure 3.1 is{(Complex.add, [(Type(x),Val(x)),(Type(y),Val(y))])} since as part of the teststatement’s execution, the method Complex.add is called with the two inputs xand y. Figure 3.2 illustrates the interaction between test code and production codefor the execution of this test statement. Note that the PMC of a test statement thatdoes not call any production methods is an empty set.Further, our model needs to accommodate the ability of moving test statementsfrom a source position in one test case to a destination position potentially in anothertest case. In order to preserve what the test statement does after the movement, weneed to provide the same data and control dependencies the test statement had inthe source position. If we know which variables are used as part of the execution ofa test statement, we can determine if we can safely move it to another destinationposition in the test suite.37Definition 3 (Used Variables of Test Statements (UVS)). The Used Variables ofa test statement (UV S(st j)) is a set of (Type(xi),Val(xi)) pairs where each variablexi is used in the execution of st j.For example, the used variables set of line 5 in Figure 3.1 contains {(Type(x),Value(x)), (Type(y),Value(y))}.For assertions, we also keep track of the method calls that create the value ofthe variables. Since assertions check the output of particular production methodcalls, we need to capture this information as part of the test model. For example inline 6 of Figure 3.1 the assertion checks the output of Complex.add with specificinputs. It is possible to retrieve the whole chain of method calls that the assertionchecks in this way. For example, the assertion checks the output of Complex.addand in turn checks the output of two constructor calls of Complex.Complex withtheir inputs.Definition 4 (Used Variables of Assertions (UVA)). The Used Variables of anassertion (UVA(as)) is a set of (Type(xi),Val(xi),Meth(xi)) tuples where as partof the assertion’s execution the variable xi is used and Meth(xi) is the PMC of thetest statement that assigns the value of xi.For instance, the UVA set of line 6 is {(Type(z),Value(z),Meth(z))}; in thiscase Meth(z)= {(Complex.add, [(Type(x),Val(x)),(Type(y),Val(y))])} since vari-able z is being used as part of the execution of the assertion and its value is createdby the Complex.add production method call (PMC of the test statement at line5).In addition to data dependencies, a test statement can have definition depen-dencies on its previous test statements. For example, in line 27 of Figure 3.1, thetest statement does not depend on the value of q. We can replace the variable qwith any other variable of type Complex given that the data dependency of thetest statement is satisfied. In this case, the test statement needs three variables ofthe type Complex defined in the previous test statements, in addition to its datadependencies, in order to be executed. The Defined Variable Set (DV) of a teststatement captures this definition dependency.38Definition 5 (Defined Variables (DV)). The Defined Variables of a test statementis a bag of the variable types that are referenced in the test statement and need tobe defined in the previous test statements.Consider Figure 3.1 again. The defined variable set of line 27 is{Complex,Complex,Complex} since the variables dividend, divisor and qneed to have been defined for the test statement.To perform data and definition dependency analysis, we maintain a test state.Definition 6 (Test State). A Test State encompasses information regarding thedefined variables, their values, and the PMC that created those values at a spe-cific test statement in the test case. Formally, the Test State (S j) is a set of(Type(xi),Val(xi),Meth(xi)) tuples for each variable xi referable from jth teststatement in the test case.In the Java programming language and JUnit testing framework, the test stateincludes information about local variables, static field of loaded classes, and membervariables of the test class. For example in Figure 3.1, the test state before the ex-ecution of line 5 is {(Type(x),Value(x),Meth(x)), (Type(y),Value(y),Meth(y))};since the two variables x and y are referable at line 5.Meth(x) = {(Complex.Complex,{(double,3.0),(double,4.0)})} since x is createdby the production method call of the constructor Complex.Complex with twoinput values of 3.0 and 4.0 of the type double.We do not consider variables’ identity (such as its memory address) or name aspart of the test state since most tests do not depend on object’s identity or variablenames in the test code. However, field names in objects are part of the test state,since those variables are defined in the production code.It is possible to move a test statement only if the test statement is compatiblewith the test state at the destination position.Definition 7 (Compatible States). A test state is compatible with a test statementif it satisfies the test statement’s data and definition dependencies. In this case,the test statement can be executed on the test state while preserving its behaviour.Formally, a test state (Si) is compatible with a test statement (st j) iff its used variables(UV S(st j)) and defined variables (DV (st j)) are subsets of the test state ((UV S(st j)⊂39Si)∧ (DV (st j)⊂ De f (Si))). De f (Si) denotes the set of defined variables in the teststate Si.Note that the compatibility relation for an assertion is defined similarly.Now that we have all the required information, we can define a test suite modelthat enables our fine-grained test analysis at the test state and test statement levels.Definition 8 (Test Suite Model). A Test Suite Model is a directed graph denotedby < r,V,E > where V is a set of vertices representing test states, E is a set ofdirected edges representing test statements and assertions, and r denotes the root ofthe graph, which is the initial empty state.Figure 3.3 depicts the test suite model of the test suite of Figure 3.1. Ovalsillustrate test states and rectangles illustrate labels of test statement edges. Dot-ted lines represent the state compatibility relations between test states and teststatements. Note that we have illustrated only a subset of compatibility relationsto avoid cluttering the graph. For example, the statement of line 12 is compati-ble with the states S2, S3, S4, S5, S6 and S7 since the read variable set at line 12{(Complex,{(Complex.r,3.0),(Complex.i,4.0)}), (Complex,{(Complex.r,5.0),(Complex.i,6.0)})} is a subset of these test states. With the notion of compati-ble states, we can determine possible valid reorganizations of test statements in testcases. For example we can relocate the test statement of line 12 in Figure 3.3 to thelocation after S3, S4, S5, S6 or S7.40Complex : {(Complex.r, 3.0), (Complex.i, 4.0)}Complex : {(Complex.r, 3.0), (Complex.i, 4.0)}Complex : {(Complex.r, 5.0), (Complex.i, 6.0)}Complex : {(Complex.r, 3.0), (Complex.i, 4.0)}Complex : {(Complex.r, 5.0), (Complex.i, 6.0)}Complex : {(Complex.r, 8.0), (Complex.i, 10.0)}Complex : {(Complex.r, 3.0), (Complex.i, 4.0)}Complex : {(Complex.r, 5.0), (Complex.i, 6.0)}Complex : {(Complex.r, -2.0), (Complex.i, -2.0)}Complex : {(Complex.r, 3.0), (Complex.i, 4.0)}Complex : {(Complex.r, 5.0), (Complex.i, 6.0)}Complex : {(Complex.r, -9.0), (Complex.i, 38.0)}Complex : {(Complex.r, 3.0), (Complex.i, 4.0)}Complex : {(Complex.r, 5.0), (Complex.i, 6.0)}Complex : {(Complex.r, 0.64), (Complex.i, 0.03)}S0S1S2S3 S4 S5 S73:Complex x = new Complex(3.0, 4.0); 10:Complex x = new Complex(3.0, 4.0);24:Complex dividend = new Complex(3.0, 4.0);4:Complex y = new Complex(5.0, 6.0); 11:Complex y = new Complex(5.0, 6.0);25:Complex divisor = new Complex(5.0, 6.0);17:Complex x = new Complex(3.0, 4.0);18:Complex y = new Complex(5.0, 6.0);5:Complex z = x.add(y)28:assertEquals(39.0 / 61.0, z.getReal(), 1.0e-5);12:Complex z = x.subtract(y) 19:Complex z = x.multiply(y) 27:q = dividend.divide(divisor)26:Complex q;Complex : {(Complex.r, 3.0), (Complex.i, 4.0)}Complex : {(Complex.r, 5.0), (Complex.i, 6.0)}Complex : {}20:assertEquals(-9.0, z.getReal(), 1.0e-5);13:assertEquals(-2.0, z.getReal(), 1.0e-5);6:assertEquals(8.0, z.getReal(), 1.0e-5);S6Figure 3.3: Extracted partial model for the running example. Ovals illustrate test states and rectangles illustrate labels oftest statement edges. Dotted lines represent the state compatibility relations between test states and test statements.41To detect redundancies in the test suite, we look at the external behaviour ofeach test statement to detect those that have identical external behaviour. Theseequivalent test statements are identical as far as testing the production code isconcerned.Definition 9 (Equivalent Test Statements). Equivalent Test Statements are the setof test statements that have the same production method calls (PMC) set.To preserve the coverage of the production code, we need to call at leastone of the test statements in each set of equivalent test statements. For exam-ple in Figure 3.1, the set of equivalent test statements is {{st3,st10,st17,st24},{st4,st11,st18,st25}, {st5},{st6},{st12},{st13},{st19},{st20},{st27},{st28}}. In thiscase, we only need to include one of the test statements from the set of{st3,st10,st17,st24} to maintain the coverage.3.1.2 Inferring The ModelWe now describe how we create the model given a test suite. Figure 3.4 shows theoverview of our approach.42Test CodeProduction CodeInstrumented Test CodeInstrumented Production Code Run Test CasesTracesProduction Method CallsTest StatesIdentify Used VariablesUsed Variables setIdentify Compatible StatesCompatible StatesIdentify Equivalent Test StatementsEquivalent Test StatementsCreate the Test Suite GraphTest Suite GraphReorganizing TestCasesReorganized Test PathsCompose Reorganized TestcaseReorganized Test SuiteTest Decomposition and Model GenerationReorganized Test CompositionInstrumentInstrumentFigure 3.4: Overview of our Approach.43Equivalent Test Statements. To capture the test state at each test statement level,we store the type and value of all referable variables through code instrumentation.This includes all local variables, member variables of the test class, and static fieldsof loaded classes. To capture production method calls (PMCs), we instrument theproduction code to log the entry point, input values (including the this object fornon-static methods), and exit points of each method in the production code. Thisway, we can trace the call stack for each method. Methods with a call stack voidof any other production methods are those that are called directly by the test code.This enable us to capture the production methods that are directly called (with theirinput values) for each test statement. We execute the instrumented test cases againstthe instrumented production code, and use the traces to compute sets of equivalenttest statements that have the same PMC.Used and Defined Variables. For each method invocation, we assume all theinput variables and all of their properties are used as part of the test statementexecution. This also includes all referable variables, such as static variables andmember variables of an object that are part of a method invocation on the object. Forexample, in line 6 of Figure 3.1, we assume that all of the properties of the variable z(i.e. z.r and z.i) will be used as part of the test statement’s execution. Althoughin this case z.i is not actually used as part of the test statement’s execution. This isa conservative assumption in terms of detecting compatible states for test statements.In this case, we might not be able to detect some compatibility relations but all therelations that we detect are correct. To compute the defined variables set, we checkfor the type of the variables that are referenced in the test statement.Compatible States. To compute compatible states for a test statement st, wecheck the states in which the variables used in st have the same value. We alsocheck whether the test states satisfy the test statement’s definition dependency. Forassertions, additionally, we check for the PMC that defined the most recent valuefor the used variables. Although it is possible to make sure that the whole chainof method calls that an assertion checks remains the same, in this work, we onlyrequire that the direct method calls that an assertion checks remain the same. Thereason behind this decision is that constraining a deeper level of method calls canrestrict our options for reorganizing test cases.441 @Test2 public void testAdd_Subtract_Multiply_Divide() {3 Complex x = new Complex(3.0, 4.0);4 Complex y = new Complex(5.0, 6.0);5 Complex z = x.add(y);6 assertEquals(8.0, z.getReal(), 1.0e-5);7 Complex z_subtract = x.subtract(y);8 assertEquals(-2.0, z_subtract.getReal(), 1.0e-5);9 Complex z_multiply = x.multiply(y);10 assertEquals(-9.0, z_multiply.getReal(), 1.0e-5);11 Complex q = x.divide(y);12 assertEquals(39.0 / 61.0, q.getReal(), 1.0e-5);13 }Figure 3.5: Test suite after reorganization.Reducing Redundancy in the Test Suite. We use the inferred model to identifyand remove redundant test statements. For example, by reorganizing the four testcases of Figure 3.1, we can create a reduced test case shown in Figure 3.5, whichhas the same coverage with six fewer statements.To maintain the test suite coverage, we basically need to call each productionmethod once. Each test case in our test model is a path starting from the initial state.For example in Figure 3.3, the test testAdd is the path (st3,st4,st5,st6) in whichsti is ith test statement edge in Figure 3.3. Thus, to maintain the test suite coverage,we need to find a set of paths, starting from the initial state, that visits at least onetest statement from each set of equivalent test statements. To find such paths, wepropose a greedy algorithm.Identifying Clusters of Redundant Test Cases. The test suite model for the wholetest suite can become large (e.g., the graph for Apache Commons has 22K nodes).Since we are interested in reorganizing test cases that share some common equivalenttest statements, we can construct a test suite model for each set of test cases thatshare at least k equivalent test statements, in which k is a cutoff value. To thatend, we create a new cluster graph in which nodes are test cases and there is anedge between two nodes if there is at least k common test statements between thetwo test cases. Then, we create a test suite graph for each connected componentof this cluster graph and perform the reorganization on each of these test suitegraphs, independently. For example in Figure 3.6, nodes are test cases and edges areweighted by the number of common equivalent test statements two nodes contain.45With a cutoff value of k = 10, we get three clusters, namely for {T 1,T 2,T 4},{T 3,T 6}, and {T 7}.T3T1T4T7T6T210121155 7855T3T1T4T7T6T2101215Figure 3.6: Clustering test cases for reorganizing.Reorganization Algorithm. The intuition behind our algorithm, shown in algo-rithm 1, is to extend a path to cover as many unique test statements and assertionsas possible. In this step, we reorganize and merge test cases in each cluster. Wemaintain a set of equivalent test statements and assertions that we need to cover(uncoveredEqStmts). Each test statement operates on a compatible test stateand transforms it to another test state. If Si and Si+1 are the test states before andafter the execution of the test statement sti, the function apply(sti,Si) = Si+1 appliesthe effect of executing the test statement sti on the test state Si and returns thechanged test state. We have the test state before and after the execution of each teststatement. We assume that the test statement could potentially change all of its usedvariables. So we can compute the effect of running the test statement on each of itscompatible states. Essentially, we need to update the value of the used variables ofthe compatible test state to the values of variables in Si+1. If S j is a compatible stateof the test statement sti, then apply(sti,S j) = updateValues(USV (sti),Si+1,S j). Wemaintain a running state for the reorganized test case so that at each point we knowwhat would be the test state at the end of reorganized test case up to this point(runningState). We update this running state at each iteration of the algorithm46input : set of uncovered equivalent test statements and set of test cases in a connected componentoutput :Set of paths that visits at least one test statement in each set of equivalent test statements, andall assertions while minimizing number of test statements visitedreorganize(uncoveredEqStmts, graph) begin1 paths← /02 do3 f irst← graph.get(init)4 runningState← /05 f irst, path← pathToNearestUncovered(graph, f irst, uncoveredEqStmts, runningState)6 f rontier← f irst7 while f rontier! = null do8 f rontier,newPath← pathToNearestUncovered(graph, f rontier,uncoveredEqStmts,runningState)9 runningState← updateState(newPath)10 updateGraph(graph, f rontier,runningState)11 markAsCovered( f rontier,uncoveredEqStmts)12 path.add(newPath)end13 paths.add(path)while f irst! = null;endAlgorithm 1: Reorganization(Line 9). With the information from the test states, we update the graph to includepossibly new compatibility edges (Line 10). We start from the initial state and findthe nearest test statement node that we still need to cover. We find the shortest pathfrom the initial state to that node and do the same from that node until we havecovered at least one statement from each set of the equivalent test statements andall the assertions (Line 7). To find the shortest path, we use a variant of the best-fitsearch algorithm that also maintains the running state. We maintain the runningstate for each path that is being examined during the algorithm. For example forthe test suite model of Figure 3.3, our algorithm returns the path (st3,st4,st5,st6,st12,st13,st19, st20,st27,st28). This path is highlighted (thickline) in Figure 3.3.Composing Reduced Test Cases. Algorithm 1 gives us a set of paths that min-imizes the number of test statements executed while maintaining the test suite’scoverage. Since each test statement in the path can be originating from a differenttest case, the variables with the same value can have different names. For exam-ple in Figure 3.1, variables x and y have names dividend and divisor intestDivide. Therefore, to generate the reduced test cases, we might need torename the variables. Test statements can also define variables that are defined withthe same name previously; for example in the reorganized test case the variable z is47defined in each test case, so we need to rename variable definitions as well. Also,test statements can use member variables and member functions of the source testclass, so we need to include those in the destination test class as well. Because ofpolymorphism in object oriented programs, we also need to cast a variable to its subor super class if the static type of the variables with the same value is different insource and destination state. algorithm 2 shows the pseudocode for the algorithmresponsible for composing reorganized test case. We use a bidirectional map ofvariable values to variable names and their type to maintain the state. As we gothrough the test statements in the reorganized test case path, for each test statement,we check if we have the value for each variable in the test statement; if there isa value in the state but with different name, we rename the variable in the teststatement to the name of the variable in the state (Lines 6-11). If the type of thevariable is different, we cast the variable to the destination type. We also check thevariable definitions for name duplicates and rename the duplicates (Lines 12-16).Finally, we need to update the bidirectional state map with the changed values fromtest statement execution (Line 17).Preserving Test Suite Behaviour. Assume that we reorganize a set of test cases xinto the reorganized set of test cases y, we show that y preserves the fault revealingbehaviour of x. PMC(x) denotes the set of production method calls that the setof test cases x call with their inputs. Since PMC(x) = PMC(y), each productionmethod mi that is called as part of the execution of x, will be called with the sameinputs in y. Hence, we preserve the coverage and thus the soft oracles of x. We alsopreserve the hard oracles of x, because our approach includes all the correspondingtest assertions in the reorganized test cases. Assume that in x, assertion asi checksthe return value of the production method mi with the input ini. Let as j be the sameassertion asi that is included in y. Since UVA(asi) =UVA(as j), assertion as j willcheck the return value of mi with the same input ini. If a fault f in mi affects thereturn value of mi(ini) and is detectable by asi, it is also detectable by as j.Implementation. We implemented our approach in a tool called TESTMODLER,which is publicly available [11]. The tool is written in Java, although TESTMODLERonly supports Java programs with JUnit tests, our approach is applicable to otherprogramming languages and testing frameworks. It gets as input the path to a48input :ordered list of statements in the composed test caseoutput : renamed compilable list of statements for the composed test caserenameStatements(statementsPath) begin1 stateBiMap <Value,Set < Name,Type >>← /02 foreach stmt ∈ statementsPath do3 nameValueMapPreq < Name,Value >← getNameValuePreqVarsInStatement(stmt)4 renameMap < OldName,NewName >← /05 castMap < varName,oldType,newType >← /06 foreach (varName,varValue) ∈ nameValueMapPreq do7 varNamesInState← stateBiMap[varValue]8 else if varName 6∈ varNamesInState then9 renameMap[varName]← pickName(varNamesInState)end10 castMap← checkForTypes(stateBiMap,stmt)end11 le f tHandSideVars < Name >← getVarsInLe f tHandSide(stmt)12 foreach varName ∈ nameValueMapDe f do13 varNameInState← stateBiMap[varName]14 else if varNameInState! = null then15 renameMap[varName]← generateNewName(varName)endend16 stmt← renameStatement(stmt , renameMap,castMap)17 updateStateMap(stateBiMap,stmt)endendAlgorithm 2: Test Case CompositionJava project, instruments the test and production code, and runs the instrumentedtest code agains instrumented production code to obtain traces. TESTMODLERreorganizes the partly redundant test cases i.e. test cases that have at least oneredundant test statment, and generates a new reduced test suite.3.2 EvaluationTo assess the real-world relevance and efficacy of our approach, we address thefollowing research questions:RQ1: How prevalent are partly-redundant tests in practice?RQ2: Can TESTMODLER reduce redundancy and test execution time while pre-serving the test suite coverage?RQ3: What is the performance of running TESTMODLER?49Table 3.1: Subject systems and their characteristics.Subject System ProductionCode(KLOC)TestCode(KLOC)#Tests(Static)#Tests(Dynamic)Commons Math 45.2 59.1 3,990 6,174Commons Collections 12.3 20.3 1,264 16,069AssertJ 6.4 24.7 4,620 5,802CheckStyle 16.6 27.0 1,865 1,956Total 80.5 131.1 11,739 30,0013.2.1 Subject SystemsWe selected four open-source Java programs that have JUnit test cases. Table 3.1shows our subject systems and their characteristics. Apache Commons Math is alight-weight mathematics and statistics library [3]. Apache Commons Collectionsis a library for more powerful data structures in Java programming language [2].AssertJ is a library that provides richer typed, easy to use assertions [4]. CheckStyleis a static analysis tool that helps developers to enforce a coding standard [5]. Notethat the number of actual written static test cases in a test suite might differ from thenumber of test cases that are dynamically executed. For example, testers can createa common abstract test class with test cases and inherit this abstract test class to testthe production code under different input data and scenarios. Another example isJUnit 4 Parameterized test classes, through which testers can run multiple tests withdifferent input data using the same test class/methods.3.2.2 Procedure and ResultsPrevalence (RQ1). To answer RQ1, we measure the number of test cases thathave at least one common equivalent test statement with another test case. To dothis, we use our test suite model to identify classes of equivalent test statements inthe test suite (see Section 3.1.2). Table 3.2 shows the number of partly-redundanttests, number of clusters of test cases that have at least one common test statement(3.1.2), total number of common test statements in the whole test suite, total num-ber of unique classes of equivalent test statements, and the number of redundant50Table 3.2: Partial redundancy in the test suites.Subject System #redundanttests#clusters#commonteststatements#uniqueteststatements#redundantteststatementsCommons Math 1,354 235 4,052 1,305 2,747Commons Collections 472 122 1,856 629 1,227AssertJ 947 131 516 178 338CheckStyle 164 55 258 96 162Total 2,937 543 6,682 2,208 4,474test statements in the test suite. Figure 3.7c shows the distribution of number ofredundant test statements in each test case for those test cases that have at least oneredundant test statement. Figure 3.7b shows the distribution of number of test casesin each redundancy clusters, and Figure 3.7a shows the distribution of number oftest statements in each equivalent test statements set.Finding 11: Our results show that 2,937 (25%) out of the total number of 11,739test cases in the test suite of our subject systems are partly redundant. This meansthat it is possible to reduce the execution time of 25% of test cases by removingredundancies and repeated production method calls.Table 3.3: Test reduction.Subject System #testsreorganized#testsreorganizedinto#testsreduced#statementsreducedCommons Math 1,149 203 946 1,623Commons Collections 186 62 124 250AssertJ 392 139 253 207CheckStyle 124 47 77 122Total 1,851 451 1400 2,20251Math Collection AssertJ CheckStyle24681012Number of test statements in each equivalency class(a) Number of common test state-ments in each equivalency class.Math Collection AssertJ CheckStyle24681012Number of Test Cases in each Cluster(b) number of test cases in each clusterof test cases.Math Collection AssertJ CheckStyle12345678Number of Common Test Statements in each test case(c) Number of common test statementsin each test case.Figure 3.7: Distribution of redundant statements and test cases in the test suite.Effectiveness (RQ2). To assess the efficacy of our approach in reducing fine-grained redundancies in test cases, we ran TESTMODLER on the subject sys-tems. TESTMODLER reorganizes redundant test cases to avoid repeated productionmethod calls, which reduces the number of redundant test statements. We measured52the number of partly-redundant test cases and test statements before and after run-ning the tool on the test suite. Table 3.3 presents the number of redundant test casesthat were reorganized, the number of test cases that TESTMODLER reorganizedthese test cases into, the number of test cases reduced, and the number of redundanttest statements reduced. Note that in some tests we need to reorganize test cases thatare in the same test cluster into more than one test case. For example, for the subjectsystem AssertJ, we reorganized 947 test cases that belonged to 131 clusters into 139test cases. In this case, some test cases change the value of a member variable of thetest class, while other test cases depend on the old value of the variable; althoughTESTMODLER tries to minimize the number of reorganized test cases, in this case, itis not possible to reorganize all test cases in the cluster into one test case. In anothercase, some of test classes use a custom runner and we only reorganize together thetest cases that have the same test runner (see Section 3.3.3).Figure 3.8a and Figure 3.8b illustrate the comparison between the actual numberof redundant test cases and statements that exist in the test suite of our subjectsystems with the number of test cases and statements that our approach was able toreduce.Finding 12: In total, TESTMODLER reorganized 1,851 out of 2,937 (63%) partly-redundant test cases, which removed 2,202 out of 4,474 (49%) redundant teststatements.To assess the effects of reducing redundancy on test suite execution time, wemeasure the execution time of individual test cases. Non-deterministic test casescan have variable execution times. In some cases, such test cases retry several timesuntil they pass, which can affect the measured test suite run time.To mitigate this variability and compare the execution time of the tests beforeand after reorganization, we measure the execution time of individual test casesbefore and after reorganization. We sum up the execution time of the test casesthat TESTMODLER reorganized, before and after reorganizing the test suite. Thisway, we obtain more stable test execution results. We perform the measurements10 times and report the averages. Table 3.4 shows the execution time of the wholetest suite, the test cases that TESTMODLER reorganized, and the execution time53reduction for these test cases1. In total, TESTMODLER was able to reduce testexecution time of the four test suites by 2.4% (Average=2.1, Median = 2.3, SD=0.61, Variance=0.36). We measured the statement coverage and branch coverage,using EclEmma 2.3.3, for the test suite before and after reorganization to assesswhether TESTMODLER preserves the statement and branch coverage. As shown inTable 3.5 TESTMODLER preserves the statement and branch coverage for all thesubject systems. Note that since the subject systems Apache Commons Math andCollections have many undeterministic tests, the coverage varies from run to run,and the coverage of reorganized test suite differ slightly from the original test suitefor these subject systems.Finding 13: In total, TESTMODLER reduced the execution time of our foursubject systems by 2.4%, while maintaining the coverage.Table 3.4: TESTMODLER’s effectiveness in terms of execution time.Subject System ExecutionTime(Total)ExecutionTime(reorganized)ExecutionTimeReduction%reduced(Total)%reduced(reorganized)Commons Math 83.1 17.1 2.1 2.5 12.3Commons Collections 16.1 4.8 0.37 2.3 7.7AssertJ 3.4 0.23 0.04 1.2 17.4CheckStyle 26.3 3.1 0.6 2.4 20.1Total 128.9 25.23 3.13 2.4 12.4Performance (RQ3). To assess the performance of TESTMODLER, we measuredthe execution time for different steps of running the tool. Table 3.6 shows theexecution time for instrumentation step, running the instrumented test suite againstthe instrumented production code, and reorganizing and recomposing algorithms.On average, it takes TESTMODLER 1,027 milliseconds to reorganize and recomposeeach test case. Taking into account all three steps for running the tool, on average it1All measurements are performed on a Mac OS X machine, running on a 2.7GHz Intel Core i5with 8 GB of memory. Values reported are in seconds.5402004006008001000120014001600Math Collections AssertJ CheckStyle# Test reorganizedActualReorganized(a) Number of partly redundant test cases reorganized.050010001500200025003000Math Collections AssertJ CheckStyle# Statements reducedActualReduced(b) Number of redundant statements reduced.Figure 3.8: Comparison of optimal and actual reductions for the number oftest cases and test statements.takes 1,933 milliseconds for each redundant test case to be identified, reorganizedand recomposed.Finding 14: On average, TESTMODLER takes 1,026 milliseconds to reorganizeeach partly redundant test case.55Table 3.5: Test suite’s coverage before and after reorganization.Subject System StatementCoverageBefore(%)BranchCoverageBefore(%)StatementCoverageAfter(%)BranchCoverageAfter(%)Commons Math 92.73 85.72 92.71 85.71Commons Collections 84.46 77.27 84.43 77.23AssertJ 95.56 92.19 95.56 92.19CheckStyle 95.44 96.82 95.44 96.82Table 3.6: TESTMODLER’s performance.Subject System Instrumentation(m:s)Runninginstrumentedcode(m:s)Reorganization(m:s)Reorganizationpertestcase(ms)Totaltimepertestcase(ms)Commons Math 1:41 6:41 18:23 960 1,398Commons Collections 0:40 0:58 3:18 1,062 1,588AssertJ 4:04 0:37 6:23 976 1,693CheckStyle 2:52 10:25 3:36 1,749 8,174Total/Average 9:17 18:41 31:40 1,027 1,9333.3 Discussion3.3.1 ApplicationsOur technique can be run offline to reduce the redundancy in the test suite. Forprojects with large (and slow) test suites, the reduction in test number and executiontime can be especially beneficial.Our tool keeps both the original and reduced versions of the test suite and linksthem together. This enables tracing back any failed reorganized test case to the testcase in the original test suite. Since the reorganization might reduce the readabilityof the test suite, the reorganized test suite can be used for running test cases andregularly kept up to date with the original test suite. An important feature is that the56reorganized test suite does not need to be updated if there are no changes made tothe reorganized tests, as part of a commit. However, if parts of a reorganized testcase change, the changed test case needs to be analyzed by TESTMODLER again.As discussed in RQ3, this whole process takes less than two seconds to completefor each test case.Our test suite model can also be used for other test analysis activities. For in-stance, it can be used to analyze tests for detecting potential bugs [52] or smells [53].One of the tasks performed during test refactoring is to reorganize test cases toremove eager and lazy test smells [53]; our model in this case can help with therefactoring task, since it is not straight forward to manually reorganize test casesin a way that preserves the behaviour of the test suite. Our model can be used toidentify test cases that have common test statements and are small, and merge themto remove lazy test smell. It can also be used to reorganize large test cases intosmaller ones, to remove the eager test smell while keeping the incurred redundancyat a minimum.3.3.2 Relation to test suite reduction techniquesIn our approach, we consider two test statements equivalent if they call exactly thesame production methods with exactly the same inputs. However, it is possible touse different criteria to identify these equivalent test statements. For example, asubstitute might be to compute production method’s coverage for each test statementand consider test statements with exactly the same coverage as equivalent teststatements. By using the coverage information, our approach can act like a fine-grained test reduction technique. Existing test suite reduction techniques remove atest case that has the same coverage as other tests. Our approach on the other handonly removes redundant parts of test cases.3.3.3 LimitationsWe investigated why our tool cannot reduce all redundant test cases. First, wecannot reorganize test cases that terminate abruptly. For example, test cases withJUnit’s @Test(expected=SomeException.class) annotations throw anexception and terminate abruptly. Also, we have seen test cases that try to achieve57similar behaviour with the use of return and fail statements. Although theuse of inheritance in test code is debatable [37], all of our subject systems useinheritance in their test code heavily. We chose not to reorganize test cases in testclasses that are subclassed by another test class, since in this case the subclassmight override some of the test cases and render the reorganized test cases useless.TESTMODLER cannot reorganize test case inside a parameterized test class withtest cases of other classes, since in this case, the test case will be run with differentinputs and can only be merged with another test case that has the same inputs. Sometest classes also use custom test runners to run their test cases. For example, inone case, test cases would be retried several times with a custom runner until theypass. In this case, we can only reorganize test classes that have the same customrunner. We also chose not to reorganize test statements inside conditionals such asfor-loops, try-catch and ifs. Further, since we do not store a variable’s identity aspart of our test state (Definition 6), we do not support reorganizing test cases withassertSame assertions. Since we depend on dynamic values of variables in testcases, we do not support reorganizing nondeterministic test cases.We used test statement as the smallest unit of computation for the test model,using a smaller unit, such as bytecode operation, increases the model size andthe algorithm complexity. On the other hand, using a larger unit, such as blocksof statements, decreases our granularity in reorganizing test cases and detectingpartly-redundant test cases with common test statements.3.3.4 Threats to validitySimilar to any other experiment using a limited number of subject systems, anexternal validity threat to our results is the generalizability of our results. We triedto mitigate this threat by choosing subject systems with various sizes, domains, andtests, although we need more subject systems to fully address the generalizationthreat. With respect to reproducibility of our results, the source code of our tool andall subject systems is available online [11], making the experiments reproducible.583.4 ConclusionsIn this chapter, we proposed a test suite model that facilitates test code analysisat the test statement level. We used the proposed model to present an automatedtechnique and tool, called TESTMODLER, for reducing fine-grained redundancies intest cases, while preserving the behaviour of the test suite. We empirically evaluatedour technique on four subject systems and overall, TESTMODLER was able toreduce the number of partly-redundant test cases up to 85% and test execution time2.5%, while preserving the original test suite coverage and production method callbehaviour.59Chapter 4Related WorkEmpirical Bugs and Smells studies. Test smells were first studied by van Deursenet al. [53] and later other works defined types of test smells, such as test fixture[29], eager test, and mystery guest, and proposed methods to detect these test smells[13, 35, 54, 55]. Test smells are, however, not bugs. In this study, we focus on bugsthat change the intended behaviour of the test.Zhang et al. [65] found that the test independence assumption does not alwayshold in practice. They observed that the majority of dependent tests result in falsealarm and some of these dependencies result in missed alarms. In this case atest which should reveal a fault passes accidentally because of the environmentgenerated by another dependent test case. Test dependency (I1) is one of the 16cause subcategories for test bugs emerged in our empirical study.Lu et al. [41] studied real world concurrency bugs, and found that most ofconcurrency bugs belong to order or atomicity violations. Luo et al. [42] cat-egorized and investigated the root cause of failures in test cases manifested bynon-determinism, known as flaky tests. Flaky tests are one of the main cause cate-gories emerged in our categorization study. Our results are inline with the findingsof Lou et al in terms of the root causes of such test bugs.Li et al. [40] mined the software bug repositories to categorize types of bugsfound in production code. Their work is similar to ours in case of categorization butwe looked and categorized types of bugs in test code instead of production code.60Herzig and Nagappan [34] proposed an approach to identify false alarms. Theyuse association rule learning to automatically identify these false alarms based onpatterns learned from failing test steps in test cases that lead to a false alarm. Theauthors aim at identifying test alarms to prevent development process disruption,since a test failure halts the integration process on the code branch that test failureoccurred. Our work, however, aims at providing insights into patterns of faults intest code to help detect them by static analysis tools.Test quality. Athanasiou et al. [12] proposed a model to assess test quality basedon source code metrics. They showed that there is a high correlation between thetest quality as assessed by their model and issue handling performance. Zaidman etal. [63] investigated how production code and test code co-evolve. They introducedthree test co-evolution views, namely change history view, growth history view, andtest quality evolution view. It would be interesting to see how test bugs would fitinto these views, for instance, are test bugs introduced when they are first added orwhen they are modified later as test code co-evolves with production code?Test Refactoring. Fang et al. in [23] used assertion fingerprints to detect similartest cases that can be refactored into one single test case. They performed staticanalysis on test code and by analyzing CFG they computed branch count, mergecount, exceptional successor count for each assertion. Based on these attributesthey detect refactoring candidate test cases. Unlike their approach, our approachfinds refactoring candidates based on common redundant statements that they have.Guerra et al. in [30] visually represent test cases with a graphical notation beforeand after test refactoring to help developers verify that the behaviour of the testcase has been kept unchanged. Our approach on the other hand makes sure thatreorganizing test statements in the test suite preserve its behaviour.Xuan et al. in [60] split test cases into smaller fragments to enhance dynamicanalyses. Xuan and Monperrus in [59] perform test purification to improve fault lo-calization. They decompose failing test cases with several assertions to multiple testcases in which each test cases has only one assertion. They perform test purificationin three steps namely, test case atomization, test case slicing and refinement. First,they generate k copies for a test case which has k test case and disable all assertionsbut one in each test case. Then they perform dynamic slicing and slice the test case61from the broken assertion statement in the test and its variables. By this approachthey eliminate the unnecessary statement from the purified test cases. Finally, theyrank the suspicious statements with the purified test cases which improves faultlocalization accuracy of fault localization techniques.Davaki et al. [19] merge web applications GUI test cases to reduce test executiontime. They use a combination of browser’s DOM state and database state todefine the state of the program. They do not merge test statement and assertionsof test cases instead they focus on merging test steps. Each test case in theirapproach is comprised of several test steps and each test step is an event exercisedon GUI elements. Unlike our approach that can reorganize and interleave all validrefactorings of a unit test case, their approach can only interleave chunks of teststeps that result in the same browser’s DOM state. Fraser and Wotawa in [24] mergetest cases generated by a model checker, they compare state of the application fordifferent tests and merge only those test cases that one of the test cases is a prefix ofthe other. Our approach on the other hand can reorganize all valid refactorings oftest cases, and as opposed to their approach that operates on models, our approachoperates on real code and unit tests.Our test state representation is closely related to heap representation of [31].They store the portions of concrete heap accessible from static fields of test classesin order to find test cases that pollute the program state. While they only considerstatic fields of test classes for their state representation, in order to support testreorganizing, in addition to static fields, we need to include local variables and alsomember variables of the test class. Our state also include information about the statictype of the variables and the production method calls that create a specific value inthe test state, since for assertions we need to make sure that after reorganizing theassertion checks the output of the same function with the same inputs.Test Suite Reduction. There is a large body of work on test suite reduction and testselection techniques [62]. Different techniques [15, 43, 50, 61] are proposed forremoving redundant test cases. These techniques use different coverage criteria suchas statement coverage or branch coverage to detect redundant test cases. Although itis possible to use coverage criteria with our approach, we chose to preserve the testsuite’s behaviour and find redundant parts of test cases that call the same production62method calls with exactly the same input. Smith et al. in [49] construct call trees ofthe program and identify a subset of test cases that cover the call tree paths.Test Suite Selection. Many techniques [20] are proposed for regression test selec-tion. These techniques use different levels of granularity for tracking dependencies,such as file dependency [25] and class dependency [47], to detect affected test casesas part of a change to production code. Gligoric et al. in [25] proposed a lightweight test selection technique based on dynamic dependencies of tests of files andintegrated their technique into JUnit testing framework. They showed that sincethey use a very light weight test selection technique the end-to-end testing time islower than the prior techniques.More recently, there are techniques [39, 48, 64] that combine test reduction,test selection and test prioritization techniques. Korel et al. in [39] combine testreduction with test selection techniques. They identify the deferences betweenoriginal EFSM model and the modified model as a set of elementary model modifi-cations and use EFSM model dependence analysis to reduce regression test suiteand remove the test cases that are redundant respect with the modifications thatare made to the model. Shi et al. [48] combine test reduction and test selectiontechnique to further reduce the number of tests executed for each commit.Test Comprehension and Visualization. Greiler et al. in [27] interviewed 25eclipse developers and incorporated the finding in creating five architectural viewsnamely, plug-in modularization view, extension initialization view, extension andservice usage view, and test suite modularization view to help developers in testsuite understanding for plug-in architectures.Greiler et al. in [28] compute similarities in test execution traces to detectsimilar high level end-to-end tests and fine grained unit tests. With this approachthey were able to restore traceability links between unit tests and requirements.Kamimura et al. in [38] generate human-oriented summaries for test cases.Their approach is based on static analysis of test cases’ source code. They identifyunique method invocations for each test case and find the verification statementsrelated to these method invocations. Based on these information they generatehuman readable sentences describing the test case.63Cornelissen et al. in [16] try to get understanding about software under test byanalyzing unit tests. They perform dynamic analysis on test cases and construct asequence diagram based on traces of test executions.64Chapter 5Conclusion and Future WorkThis thesis aims at improving test code quality by (1) characterizing bugs in testcode (2) reducing redundancies in test code. In the first part of the thesis, wepresent the first large scale empirical study of bugs in test code to characterize theirprevalence, impact and root cause categories. We mine bug repositories and versioncontrol systems of 448 Apache Software Foundation (ASF) projects, which are froma broad spectrum of domains, with various sizes and programming languages. Wefind 5,556 test bugs that were reported and fixed in the test code of these projects.The focus our study was to get insight in to different categories of test bugs andtheir root causes. We (1) qualitatively study a total of 443 randomly sampled testbug reports in detail and categorized them based on their impact, root cause andfault location; (2) used our manually sampled data and applied machine learningtechniques to automatically categorize rest of test bug reports based on their rootcause; (3) compare properties of test bugs with production bugs, such as activetime and fixing effort needed; (4) investigated if FindBugs[6], a popular static bugdetection tool for Java, is effective in detecting test bugs.The results of our study show that (1) around half of all the projects analyzed hadbugs in their test code that were reported and fixed; (2) the majority of test bugs arefalse alarms i.e., test fails while the production code is correct, while a minorityof these bugs result in silent horrors i.e., test passes while the production code isincorrect; (3) incorrect and missing assertions are the dominant root cause of silenthorror bugs; (4) semantic (25%), flaky (21%), environment-related (18%) bugs are65the dominant root cause categories of false alarms; (5) the majority of false alarmbugs happen in the exercise portion of the tests; (6) developers contribute moreactively to fixing test bugs and test bugs are fixed sooner compared to productionbugs, and (7) FindBugs is not effective in detecting test bugs.Our study has implications for both developers and researchers. The charac-terization of root causes of test bugs has practical implications for developers andtesters on how to avoid these bugs. The results of our study indicate that test code issusceptible to bugs just like production code. However, current bug detection toolsare mainly geared toward production code. Our results imply that there is a need todevise new bug detection tools for detecting bugs in test code. We believe that theresults of our study will be useful to researchers and developers in building new bugdetection tools for detecting bugs in test code.In the second part of the thesis, we focus on improving test code quality byreducing redundancies in test code. While the current test reduction techniquesoperate at the test case level to detect and remove redundant test cases, we propose anapproach for performing fine-grained analysis of test cases to perform test reductionat the test statement level. Our analysis is based on the inference of a test suitemodel that enables test case reorganization, while preserving the behaviour of thetest suite. We use our test suite model to reorganize test statements in the test casesin a way that removes the redundant test statements and reduces the redundancy.We implemented our approach in a tool called TESTMODLER and evaluated it onfour open source projects. Our empirical results show that (1) 25% of all test casesin the test suite of our subject systems are partly redundant; (2) Our technique canreduce the number partly redundant test cases by 63% and the number redundanttest statements by 49%; (3) Our technique can reduce the execution time of the testsuites by 2.4% while preserving the test suite’s behaviour.5.1 Future WorkFor future work, we plan to build on top of our empirical study and analyze corre-lations between test bugs and various software metrics. We plan to use the resultsof the empirical to design a bug detection tool with test bug patterns capable ofdetecting bugs in test code. We also plan to extend our test model described in66chapter 3 and use coverage instead of production method calls (Definition 9) todetect redundancies. In this case we might not actually preserve test suite behaviour,but we can preserve the test suite’s coverage while reducing the test suite evenmore. This way, we can perform fine-grained test suite reduction while preservingthe coverage of the test suite. We also plan to investigate the use of test selectiontechniques on the reduced test suite. Since we only reorganize and merge test casesthat have common test statements, it is likely that most of the reorganized testcases can be chosen together as part of a test selection technique. For example,all test selection techniques that operate on the Class level likely choose all of thereorganized test cases together. Therefore, it might be possible to use test selectiontechniques on the reorganized test suite to further reduce the test suite.67Bibliography[1] Additional tables and figures. → pages7, 15, 30[2] Apache commons collections., . → pages 50[3] Apache commons math.,. → pages 35, 50[4] Assertj. → pages 50[5] Checkstyle. → pages 50[6] Findbugs’ rules. →pages 31, 65[7] Apache projects’ Git repositories. → pages 9[8] JGIT library. → pages 9[9] Apache projects supported by Jira. → pages 7[10] PMD rules. →pages 31[11] Testmodel:a test suite model to support test reorganizing. → pages 4, 48, 58[12] D. Athanasiou, A. Nugroho, J. Visser, and A. Zaidman. Test code quality andits relation to issue handling performance. Transactions on SoftwareEngineering, (11):1100–1125, 2014. → pages 6168[13] G. Bavota, A. Qusef, R. Oliveto, A. De Lucia, and D. Binkley. An empiricalanalysis of the distribution of unit test smells and their impact on softwaremaintenance. In Proceedings of the International Conference on SoftwareMaintenance (ICSM), pages 56–65, 2012. → pages 2, 60[14] K. Beck. Test-driven development: by example. Addison-Wesley Professional,2003. → pages 1[15] J. Black, E. Melachrinoudis, and D. Kaeli. Bi-criteria models for all-uses testsuite reduction. In Proceedings of the International Conference on SoftwareEngineering (ICSE), pages 106–115, 2004. → pages 62[16] B. Cornelissen, A. van Deursen, L. Moonen, and A. Zaidman. Visualizingtestsuites to aid in software understanding. In Proceedings of the EuropeanConference on Software Maintenance and Reengineering (CSMR)., pages213–222, 2007. → pages 64[17] C. Couto, J. a. E. Montandon, C. Silva, and M. T. Valente. Staticcorrespondence and correlation between field defects and warnings reportedby a bug finding tool. Software Quality Control, pages 241–257, 2013. →pages 14, 15, 31[18] W. Cunningham. Bugs in the tests.→ pages 1, 10[19] P. Devaki, S. Thummalapenta, N. Singhania, and S. Sinha. Efficient andflexible gui test execution via test merging. In Proceedings of theInternational Symposium on Software Testing and Analysis (ISSTA), pages34–44, 2013. → pages 62[20] E. Engström, P. Runeson, and M. Skoglund. A systematic review onregression test selection techniques. Information and Software Technology,pages 14–30. → pages 63[21] M. Erfani Joorabchi, M. Mirzaaghaei, and A. Mesbah. Works for me!Characterizing non-reproducible bug reports. In Proceedings of the WorkingConference on Mining Software Repositories (MSR), pages 62–71. ACM,2014. → pages 26[22] J.-R. Falleri, F. Morandat, X. Blanc, M. Martinez, and M. Montperrus.Fine-grained and accurate source code differencing. In Proceedings of theInternational Conference on Automated Software Engineering (ASE), pages313–324. ACM, 2014. → pages 1269[23] Z. F. Fang and P. Lam. Identifying test refactoring candidates with assertionfingerprints. In Proceedings of the Principles and Practices of Programmingon The Java Platform (PPPJ), pages 125–137. ACM, 2015. → pages 61[24] G. Fraser and F. Wotawa. Redundancy based test-suite reduction. InProceedings of the International Conference on Fundamental Approaches toSoftware Engineering (FASE), pages 291–305, 2007. → pages 62[25] M. Gligoric, L. Eloussi, and D. Marinov. Practical regression test selectionwith dynamic file dependencies. In Proceedings of the InternationalSymposium on Software Testing and Analysis (ISSTA), pages 211–222, 2015.→ pages 63[26] J. Goulding. Be careful when using JUnit’s expected exceptions. → pages 17[27] M. Greiler and A. van Deursen. What your plug-in test suites really test: anintegration perspective on test suite understanding. Empirical SoftwareEngineering, 18:859–900, 2013. → pages 63[28] M. Greiler, A. van Deursen, and A. Zaidman. Measuring test case similarityto support test suite understanding. In Proceedings of the InternationalConference on Objects, Models, Components, Patterns, pages 91–107.Springer-Verlag, 2012. → pages 63[29] M. Greiler, A. van Deursen, and M.-A. Storey. Automated detection of testfixture strategies and smells. In Proceedings of the International Conferenceon Software Testing, Verification and Validation (ICST), pages 322–331, 2013.→ pages 60[30] E. Guerra and C. Fernandes. Refactoring test code safely. In Proceedings ofthe International Conference on Software Engineering Advances (ICSEA),pages 44–44, 2007. → pages 61[31] A. Gyori, A. Shi, F. Hariri, and D. Marinov. Reliable testing: Detectingstate-polluting tests to prevent test dependency. In Proceedings of theInternational Symposium on Software Testing and Analysis (ISSTA), pages223–233, 2015. → pages 62[32] D. Hao, T. Lan, H. Zhang, C. Guo, and L. Zhang. Is this a bug or an obsoletetest? In Proceedings of the European Conference on Object-OrientedProgramming (ECOOP), pages 602–628. Springer-Verlag, 2013. → pages 2770[33] B. Hauptmann, S. Eder, M. Junker, E. Juergens, and V. Woinke. Generatingrefactoring proposals to remove clones from automated system tests. InProceedings of the International Conference on Program Comprehension,pages 115–124, 2015. → pages 2[34] K. Herzig and N. Nagappan. Empirically detecting false test alarms usingassociation rules. In Proceedings of the International Conference on SoftwareEngineering (ICSE). IEEE, 2015. → pages 61[35] D. Hovemeyer and W. Pugh. Finding bugs is easy. SIGPLAN Notices, 39(12):92–106, 2004. → pages 14, 60[36] C. Jones. Programming Productivity. McGraw-Hill., New York, NY, 1986.→ pages 1[37] P. Kainulainen. Three reasons why we should not use inheritance in our tests. → pages 58[38] M. Kamimura and G. Murphy. Towards generating human-orientedsummaries of unit test cases. In Proceedings of the International Conferenceon Program Comprehension (ICPC), pages 215–218, 2013. → pages 63[39] B. Korel, L. H. Tahat, and B. Vaysburg. Model based regression test reductionusing dependence analysis. In Proceedings of the International Conference onSoftware Maintenance (ICSM), pages 214–223, 2002. → pages 63[40] Z. Li, L. Tan, X. Wang, S. Lu, Y. Zhou, and C. Zhai. Have things changednow?: An empirical study of bug characteristics in modern open sourcesoftware. In Proceedings of the Workshop on Architectural and SystemSupport for Improving Software Dependability, pages 25–33. ACM, 2006. →pages 60[41] S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from mistakes: Acomprehensive study on real world concurrency bug characteristics. InProceedings of the 13th International Conference on Architectural Supportfor Programming Languages and Operating Systems, pages 329–339. ACM,2008. → pages 60[42] Q. Luo, F. Hariri, L. Eloussi, and D. Marinov. An empirical analysis of flakytests. In Proceedings of the International Symposium on Foundations ofSoftware Engineering (FSE), pages 643–653. ACM, 2014. → pages 9, 26, 6071[43] M. Marre and A. Bertolino. Using spanning sets for coverage testing. IEEETransactions on Software Engineering, pages 974–984, 2003. → pages 62[44] S. McConnell. Code Complete. Microsoft Press., Redmond, WA, 1993. →pages 1[45] D. Meyer. Organize your JIRA issues with subcomponents.,2013. → pages 8[46] M. G. Nanda, M. Gupta, S. Sinha, S. Chandra, D. Schmidt, andP. Balachandran. Making defect-finding tools work for you. In Proceedings ofthe International Conference on Software Engineering (ICSE), pages 99–108.ACM, 2010. → pages 31[47] A. Orso, N. Shi, and M. J. Harrold. Scaling regression testing to largesoftware systems. In Proceedings of the ACM SIGSOFT InternationalSymposium on Foundations of Software Engineering (FSE), pages 241–251,2004. → pages 63[48] A. Shi, T. Yung, A. Gyori, and D. Marinov. Comparing and combiningtest-suite reduction and regression test selection. In Proceedings of the JointMeeting on Foundations of Software Engineering (FSE), pages 237–247,2015. → pages 63[49] A. M. Smith, J. Geiger, G. M. Kapfhammer, and M. L. Soffa. Test suitereduction and prioritization with call trees. In Proceedings of InternationalConference on Automated Software Engineering (ASE), pages 539–540, 2007.→ pages 63[50] S. Tallam and N. Gupta. A concept analysis inspired greedy algorithm for testsuite minimization. In Proceedings of the 6th ACM SIGPLAN-SIGSOFTWorkshop on Program Analysis for Software Tools and Engineering, PASTE’05, pages 35–42, 2005. → pages 62[51] L. Tan, C. Liu, Z. Li, X. Wang, Y. Zhou, and C. Zhai. Bug characteristics inopen source software. Empirical Software Engineering, 19(6):1665–1705,2014. → pages 2[52] A. Vahabzadeh, A. Milani Fard, and A. Mesbah. An empirical study of bugsin test code. In Proceedings of the International Conference on SoftwareMaintenance and Evolution (ICSME), page 10 pages. IEEE Computer Society,2015. → pages iii, 5, 6, 5772[53] A. van Deursen, L. Moonen, A. v. d. Bergh, and G. Kok. Refactoring testcode. In Extreme Programming Perspectives, pages 141–152.Addison-Wesley, 2002. → pages 57, 60[54] B. van Rompaey, B. Du Bois, and S. Demeyer. Characterizing the relativesignificance of a test smell. In Proceedings of the International Conference onSoftware Maintenance (ICSM), pages 391–400, 2006. → pages 60[55] B. van Rompaey, B. D. Bois, S. Demeyer, and M. Rieger. On the detection oftest smells: A metrics-based approach for general fixture and eager test. IEEETransactions on Software Engineering, 33(12):800–817, 2007. → pages 60[56] R. J. Weiland. The Programmer’s Craft: Program Construction ComputerArchitecture, and Data Management. Reston Publishing., Reston, VA, 1983.→ pages 1[57] W. Weimer and G. C. Necula. Finding and preventing run-time error handlingmistakes. In Proceedings of the Conference on Object-oriented Programming,Systems, Languages, and Applications (OOPSLA), pages 419–431. ACM,2004. → pages 32[58] X. Xiao, S. Thummalapenta, and T. Xie. Advances on improving automationin developer testing. Advances in Computers, 85:165–212, 2012. → pages 1[59] J. Xuan and M. Monperrus. Test case purification for improving faultlocalization. In Proceedings of the International Symposium on Foundationsof Software Engineering, pages 52–63, 2014. → pages 61[60] J. Xuan, B. Cornu, M. Martinez, B. Baudry, L. Seinturier, and M. Monperrus.B-refactoring: Automatic test code refactoring to improve dynamic analysis.Information and Software Technology, 76:65 – 80, 2016. → pages 61[61] X. ying MA, Z. feng He, B. kui Sheng, and C. qing Ye. A genetic algorithmfor test-suite reduction. In Proceedings of the International Conference onSystems, Man and Cybernetics, pages 133–139 Vol. 1, 2005. → pages 62[62] S. Yoo and M. Harman. Regression testing minimization, selection andprioritization: A survey. Softw. Test. Verif. Reliab., 22(2):67–120, Mar. 2012.→ pages 3, 62[63] A. Zaidman, B. van Rompaey, S. Demeyer, and A. van Deursen. Miningsoftware repositories to study co-evolution of production and test code. InProceedings of the International Conference on Software Testing, Verificationand Validation (ICST), pages 220–229, 2008. → pages 1, 27, 6173[64] L. Zhang, D. Marinov, and S. Khurshid. Faster mutation testing inspired bytest prioritization and reduction. In Proceedings of the InternationalSymposium on Software Testing and Analysis (ISSTA), pages 235–245. →pages 63[65] S. Zhang, D. Jalali, J. Wuttke, K. Mus¸lu, W. Lam, M. D. Ernst, and D. Notkin.Empirically revisiting the test independence assumption. In Proceedings ofthe International Symposium on Software Testing and Analysis (ISSTA), pages385–396. ACM, 2014. → pages 26, 60[66] Y. Zhang and A. Mesbah. Assertions are strongly correlated with test suiteeffectiveness. In Proceedings of the joint meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium on theFoundations of Software Engineering (ESEC/FSE), pages 214–224. ACM,2015. → pages 374


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items