UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A study of the influence of assertions and mutants on test suite effectiveness Zhang, Yucheng 2016

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2016_november_zhang_yucheng.pdf [ 4.56MB ]
Metadata
JSON: 24-1.0319084.json
JSON-LD: 24-1.0319084-ld.json
RDF/XML (Pretty): 24-1.0319084-rdf.xml
RDF/JSON: 24-1.0319084-rdf.json
Turtle: 24-1.0319084-turtle.txt
N-Triples: 24-1.0319084-rdf-ntriples.txt
Original Record: 24-1.0319084-source.json
Full Text
24-1.0319084-fulltext.txt
Citation
24-1.0319084.ris

Full Text

A Study of the Influence of Assertionsand Mutants on Test Suite EffectivenessbyYucheng ZhangB.Sc., The University of British Columbia, 2014A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF APPLIED SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Electrical and Computer Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)October 2016c© Yucheng Zhang 2016AbstractTest suite effectiveness is measured by assessing the portion of faults that canbe detected by tests. To precisely measure a test suite’s effectiveness, oneneed to pay attention to both tests and the set of faults used. Code coverageis a popular test adequacy criterion in practice. Code coverage, however,remains controversial as there is a lack of coherent empirical evidence for itsrelation with test suite effectiveness. More recently, test suite size has beenshown to be highly correlated with effectiveness. However, previous studiestreat test methods as the smallest unit of interest, and ignore potential factorsinfluencing the correlation between test suite size and test suite effectiveness.We propose to go beyond test suite size, by investigating test assertions insidetest methods. First, we empirically evaluate the relationship between a testsuite’s effectiveness and the (1) number of assertions, (2) assertion coverage,and (3) different types of assertions. We compose 6,700 test suites in total,using 24,000 assertions of five real-world Java projects. We find that thenumber of assertions in a test suite strongly correlates with its effectiveness,and this factor positively influences the relationship between test suite sizeand effectiveness. Our results also indicate that assertion coverage is stronglycorrelated with effectiveness. Second, instead of only focusing on the testingside, we propose to investigate test suite effectiveness also by consideringfault types (the ways faults are generated) and faults in different types ofstatements. Measuring a test suite’s effectiveness can be influenced by usingfaults with different characteristics. Assessing test suite effectiveness withoutpaying attention to the distribution of faults is not precise. Our resultsindicate that fault type and statement type where the fault is located cansignificantly influence a test suite’s effectiveness.iiPrefaceThis thesis presents two large-scale empirical studies on the influencing fac-tors of test suite effectiveness, conducted by myself in collaboration withmy supervisor Professor Ali Mesbah. Chapter 3 has been published as afull conference paper at the joint meeting of the European Software Engi-neering Conference and the ACM SIGSOFT Symposium on the Foundationsof Software Engineering (ESEC/FSE 2015) [44]. The results described inChapter 4 are submitted to a software testing conference as a full paper. I wasresponsible for devising the experiments, running the experiments, analyzingthe results, and writing the manuscript. My supervisor was responsible forguiding me with the creation of the idea and experimental methodology, thedesign of the procedure for analyzing the experimental results, as well aswriting Chapters 3 and 4.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . 21.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . 32 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Coverage Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Test Suite Size . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Assertion Coverage . . . . . . . . . . . . . . . . . . . . . . . 52.4 Test Characteristics . . . . . . . . . . . . . . . . . . . . . . . 52.5 Mutation Operators . . . . . . . . . . . . . . . . . . . . . . . 53 Assertions Are Strongly Correlated with Test Suite Effec-tiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.1 Experimental Design . . . . . . . . . . . . . . . . . . . . . . 73.1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . 103.1.2 Subject Programs . . . . . . . . . . . . . . . . . . . . 103.1.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2.1 Effectiveness of Assertion Quantity . . . . . . . . . . 173.2.2 Effectiveness of Assertion Coverage . . . . . . . . . . 213.2.3 Effectiveness of Assertion Types . . . . . . . . . . . . 25ivTable of Contents4 Fault Type and Location Influence Measuring Test SuiteEffectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.1 Experimental Design . . . . . . . . . . . . . . . . . . . . . . 314.1.1 Subject Programs . . . . . . . . . . . . . . . . . . . . 314.1.2 Procedure . . . . . . . . . . . . . . . . . . . . . . . . 354.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2.1 Fault Type (RQ1) . . . . . . . . . . . . . . . . . . . . 374.2.2 Fault Location (RQ2) . . . . . . . . . . . . . . . . . . 415 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.1 Test Suite Size vs. Assertion Quantity . . . . . . . . . . . . . 525.2 Implicit vs. Explicit Mutation Score . . . . . . . . . . . . . . 525.3 Statement vs. Assertion Coverage . . . . . . . . . . . . . . . 535.4 Assertion Type . . . . . . . . . . . . . . . . . . . . . . . . . . 535.5 Distribution of Mutants . . . . . . . . . . . . . . . . . . . . . 545.6 Mutation Selection . . . . . . . . . . . . . . . . . . . . . . . . 545.7 Statement Type . . . . . . . . . . . . . . . . . . . . . . . . . 545.8 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . 556 Conclusions and Future Work . . . . . . . . . . . . . . . . . . 57Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60vList of Tables3.1 Characteristics of the subject programs. . . . . . . . . . . . . 93.2 Number of assertions per test case. . . . . . . . . . . . . . . . 113.3 Mutation data for the subject programs. . . . . . . . . . . . . 143.4 Correlation coefficients between test suite size and effectiveness(m), and assertion quantity and effectiveness (a). ρp showsPearson correlations and ρk represents Kendall’s correlations. 193.5 Correlations between number of assertions and suite effective-ness, when suite size is controlled for. . . . . . . . . . . . . . . 203.6 Statistics of test suites composed at different assertion coveragelevels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.7 One-Way ANOVA on the effectiveness of assertion contenttypes and actual assertion types. . . . . . . . . . . . . . . . . 273.8 Tukey’s Honest Significance Test on the effectiveness of as-sertion content types and assertion method types. Each ofthe sample test suites used for the comparison contains 100assertions of a target type. . . . . . . . . . . . . . . . . . . . . 293.9 One-Way ANOVA on the effectiveness of actual assertion typesand assertion content types on JFreeChart. . . . . . . . . . . 304.1 Characteristics of the subject programs. . . . . . . . . . . . . 334.2 Default mutation operators provided by PIT. . . . . . . . . . 344.3 One-Way ANOVA on the effectiveness of actual assertion typesand assertion content types on JFreeChart. . . . . . . . . . . 374.4 One-Way ANOVA on the number of mutants killed. Each ofthe sample subsets of mutants used for the comparison contains10 mutants generated by a specific mutation operator. . . . . 404.5 Tukey’s Honest Significance Test on the number of mutantskilled. Each of the sample subsets of mutants used for the com-parison contains 10 mutants generated by a specific mutationoperator.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41viList of Tables4.6 Tukey’s Honest Significance Test on the number of mutantskilled between using all mutation operators and omitting eachmutation operator in turn. Each of the sample subsets contains100 mutants either generated by all mutation operators orleaving out each operator in turn. . . . . . . . . . . . . . . . . 424.7 Tukey’s Honest Significance Test on the number of mutantskilled. Each of the sample subsets of mutants used for thecomparison contains 100 mutants. . . . . . . . . . . . . . . . . 474.8 Distribution of mutation location for each mutation operator. 484.9 Statistics of mutants at different nesting level. . . . . . . . . . 504.10 Statistics of explicit detectable mutants at different nesting level. 50viiList of Figures3.1 Plots of (a) suite size versus effectiveness, (b) assertion quantityversus effectiveness, and (c) suite size versus assertion quantity,for the 1000 randomly generated test suites from JFreeChart.The other four projects share a similar pattern consistently. . 183.2 Plot of mutation score against suite size for test suites gener-ated from assertion buckets low, middle, and bucket high fromJFreeChart. The other four projects share a similar patternconsistently. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3 Mutation score (above) and explicit mutation score (below)plot against assertion coverage for the five subject programs.Each box represents the 50 test suites of a given assertioncoverage that were generated from the original (master) testsuite for each subject. The Kendall’s correlation is 0.88–0.91between assertion coverage and mutation score, and 0.80–0.90between assertion coverage and explicit mutation score. . . . 223.4 Plots of (a) mutation score against assertion coverage andstatement coverage, (b) assertion coverage against assertionquantity, for the 1000 randomly generated test suites fromJFreeChart. . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.5 Plots of (a) assertion quantity versus effectiveness of human-written and generated tests, (b) assertion content types versuseffectiveness, and (c) assertion method types versus effective-ness. In (b) and (c), each box represents the 50 sample testsuites generated for each type; the total number of assertionsof each type are indicated in red. . . . . . . . . . . . . . . . . 263.6 Plots of (a) number of mutants killed versus assertion contenttypes, (b) number of mutants killed versus assertion methodtypes. Each box represents the 50 sample test suites generatedfor each assertion type. . . . . . . . . . . . . . . . . . . . . . . 28viiiList of Figures4.1 Plots of number of mutable locations, number of mutantsgenerated, and number of mutants detected by all of themutation operators. . . . . . . . . . . . . . . . . . . . . . . . . 394.2 Plots of number of different types of statements, number ofmutants generated by mutating different types of statements,and number of mutants detected from the generated mutants. 434.3 Plot of mutation score against percentage of mutants generatedin different types of statements. For each type of statement, adata point represents one of the for subject programs. . . . . 454.4 A box-plot of number of mutants killed at different mutation lo-cations : return statements, condition statements, and normalstatements. Each box represents the 100 subsets of mutantsthat were randomly selected from all mutants generated at thelocation for each subject. Each subset contains 100 mutants. 46ixAcknowledgementsI would like to thank all people who helped me successfully obtain my MAScdegree. First of all, I would like to thank my advisor Professor Ali Mesbah,for his patient guidance, encouragement and advice throughout my time ashis student. I have been extremely lucky to have a supervisor who cared somuch about my work, and who responded to my questions so promptly.I would also like to thank my fellow colleagues in the SALT lab at UBCfor the stimulating discussions, critical feedback on my work, and for all thefun we have had in the last two years. They made my entire experienceenjoyable.Last but not the least, I would also like to thank my father Wen Zhang,my mother Huiping Dai, my uncle He Dai, my aunt Huihong Dai, mygrandparents Ruying Ding and Houfa Dai, and all my relatives. I would nothave the opportunity to pursue my study without their love and unwaveringsupport.xChapter 1IntroductionSoftware testing has become an integral part of software development. Asoftware product cannot be confidently released unless it is adequately tested.Code coverage is the most popular test adequacy criterion in practice. How-ever, coverage alone is not the goal of software testing, since coverage withoutchecking for correctness is meaningless. A more meaningful adequacy metric isthe fault detection ability of a test suite, also known as test suite effectiveness.Test suite effectiveness measures how well a test suite is capable ofdetecting faults. A common technique used to measure test suite effectivenessis mutation testing, in which small programmer errors are simulated to see ifthe test suite can kill the mutant. Naturally, the measurement is dependenton two components, namely, the test suite and faults simulated.There have been numerous studies analyzing the relationship betweentest suite size, code coverage, and test suite effectiveness [17, 18, 23–26, 33].More recently, Inozemtseva and Holmes [28] found that there is a moderateto very strong correlation between the effectiveness of a test suite and thenumber of test methods, but only a low to moderate correlation betweenthe effectiveness and code coverage when the test suite size is controlledfor. These findings imply that (1) the more test cases there are, the moreeffective a test suite becomes, (2) the more test cases there are, the higher thecoverage, and thus (3) test suite size plays a prominent role in the observedcorrelation between coverage and effectiveness.All these studies treat test methods as the smallest unit of interest.However, we believe such coarse-grained studies are not sufficient to showthe main factors influencing a test suite’s effectiveness. In this thesis, wepropose to dissect test methods and investigate why test suite size correlatesstrongly with effectiveness. To that end, we focus on test assertions insidetest methods. Test assertions are statements in test methods through whichdesired specifications are checked against actual program behaviour. As such,assertions are at the core of test methods. We hypothesize that assertions1have a strong influence on test suite effectiveness, and this influence, in turn,1We use the terms ‘assertion’ and ‘test assertion’ interchangeably in this thesis.11.1. Thesis Contributionis the underlying reason behind the strong correlation between test suite size,code coverage, and test suite effectiveness.Additionally, measuring test suite effectiveness not only depends on thequality of the tests, but it might also depend on which faults are seeded andwhere. Assessing test suite effectiveness without noticing the distributionof faults can be biased. We hypothesis that, seeded faults with differentcharacteristics can influence measuring a test suite’s effectiveness. In thisthesis, we also measure test effectiveness against different types of faultsand faults located in different types of statements. To the best of ourknowledge, we are the first to conduct a large-scale empirical study to assessthe relationship between test suite effectiveness and fault type and location.In this thesis, We use mutants to simulate real faults. Mutants have beenwidely adopted to substitute real faults in the literature [18, 27, 28, 33, 37, 39].There is also empirical evidence for mutants being representable for real faults[14, 15, 19, 29]. We generate mutants by using the seven default mutationoperators provided by PIT [10].1.1 Thesis ContributionThis thesis makes the following main contributions:• The first large-scale study analyzing the relation between test assertionsand test suite effectiveness. Our study composes 6,700 test suites intotal, from 5,892 test cases and 24,701 assertions of five real-world Javaprojects in different sizes and domains.• Empirical evidence that (1) test assertion quantity and assertion cover-age are strongly correlated with a test suite’s effectiveness, (2) assertionquantity can significantly influence the relationship between a testsuite’s size and its effectiveness, (3) the correlation between statementcoverage and effectiveness decreases dramatically when assertion cover-age is controlled for.• A classification and analysis of the effectiveness of assertions based ontheir properties, such as (1) creation strategy (human-written versusautomatically generated), (2) the content type asserted on, and (3) theactual assertion method types.• Empirical evidence that assertions classified by their types, such as thecontent type they assert on, or assertion method types, (1) might notbe fine-grained enough to differentiate from each other, and (2) cannotsignificantly influence a test suite’s effectiveness21.2. Thesis Organization• A quantitative analysis of the relationship between (1) seeded fault typeand test suite effectiveness, and (2) seeded fault location (in differentstatement types) and test suite effectiveness.• Empirical evidence that fault type and statement type where fault islocated can significantly influence a test suite’s measured effectiveness.1.2 Thesis OrganizationThis chapter serves to establish the overarching goal and motivation of thisthesis. Chapter 2 discusses the related work. Chapter 3 describes in detail theexperimental design and results found from the investigation of the influenceof assertions on test suite effectiveness. Chapter 4 describes in detail theexperiments we conducted to explore the influence of mutants on measuringtest suite effectiveness. Chapter 5 discusses the findings from Chapter 3 andChapter 4, and Chapter 6 concludes and presents future research directions.An initial version of Chapter 3 is published as a full conference paper:Yucheng Zhang, and Ali Mesbah. “Assertions Are Strongly Correlated withTest Suite Effectiveness”. In Proceedings of the joint meeting of the EuropeanSoftware Engineering Conference and the ACM SIGSOFT Symposium onthe Foundations of Software Engineering (ESEC/FSE). 214-224, 2015 [44].3Chapter 2Related Work2.1 Coverage MetricsThere is a large body of empirical studies investigating the relationshipbetween different coverage metrics (such as statement, branch, MC/DC) andtest suite effectiveness [17, 18, 23–26]. All these studies find some degree ofcorrelation between coverage and effectiveness. However, coverage remains acontroversial topic [18, 28] as there is no strong evidence for its direct relationwith effectiveness. The reason is that coverage is necessary but not sufficientfor a test suite to be effective. For instance, a test suite might achieve 100%coverage but be void of test assertions to actually check against the expectedbehaviour, and thus be ineffective.2.2 Test Suite SizeResearchers have also studied the relationship between test suite size, coverage,and effectiveness [28, 33]. In these papers, test suite size is measured in termsof the number of test methods in the suite. Different test suites, with sizecontrolled, are generated for a subject program to study the correlationbetween coverage and effectiveness. Namin and Andrews [33] report thatboth size and coverage independently influence test suite effectiveness. Morerecently, Inozemtseva and Holmes [28] find that size is strongly correlated witheffectiveness, but only a low to moderate correlation exists between coverageand effectiveness when size is controlled for. None of these studies, however,looks deeper into the test cases to understand why size has a profound impacton effectiveness. In our work, we investigate the role test assertions play ineffectiveness.42.3. Assertion Coverage2.3 Assertion CoverageSchuler and Zeller [38] propose the notion of ‘checked coverage’2 as a metricto assess test oracle quality. Inspired by this work, we measure assertioncoverage of a test suite as the percentage of statements directly covered bythe assertions. We are interested in assertion coverage because it is a metricdirectly related with the assertions in the test suite. In the original paper[38], the authors evaluated the metric by showing that there is a similartrend between checked coverage, statement coverage, and mutation score. Inthis thesis, we conduct an empirical study on the correlation level betweenassertion coverage and test suite effectiveness. In addition, we compose alarge set of test suites (up to thousands) for each subject under test, whereasonly seven test suites were compared in the original paper. Moreover, westudy the correlation between statement coverage and test suite effectiveness,to compare with the relationship between assertion coverage and test suiteeffectiveness, by composing test suites with assertion coverage controlled.2.4 Test CharacteristicsCai and Lyu [18] studied the relationship between code coverage and faultdetection capability under different testing characteristics. They found thatthe effect of code coverage on fault detection varies under different testingprofiles. Also, the correlation between the two measures is strong withexceptional test cases, while weak in normal testing settings. However, theydid not examine the role assertions might play in different profiles on theeffectiveness of test cases. To the best of our knowledge, we are the first toinvestigate the influence of different assertion properties on suite effectiveness.We classify assertion properties in three categories, and study the effectivenessof each classification separately.2.5 Mutation OperatorsResearchers have also sought to find sufficient subsets of mutation operatorsto reduce the computational cost of mutation testing. Offutt and Rothermelempirically compared the effectiveness of selective mutation with standardmutation in [34, 35]. They empirically evaluated [35] the mean loss inmutation score using 2-selective, 4-selective, and 6-selective mutation. They2We use the terms ‘assertion coverage’ and ‘checked coverage’ interchangeably in thisthesis.52.5. Mutation Operatorsfound [34] 5 out of 22 mutation operators, used by Mothra, suffice to efficientlyimplement mutation testing for achieving 99.5 percent mutation score. Wongand Mathur [42] implemented the idea of constraint mutation by using onlytwo mutation operators. From their evaluation, 80 percent of mutants arereduced with only 5 percent loss in mutation score. Mresa and Bottaci[30] took the cost of detecting equivalent mutants into consideration whileevaluating selective mutation.In addition, researchers have explored [31, 32, 40, 41] the sufficient subsetof mutation operators for measuring test effectiveness. More recently, Denget al. [22] and Delamaro et al. [20] evaluated statement deletion mutationoperator (SDL), and found mutants generated by SDL require tests that arehighly effective at killing other mutants for Java and C programs, respectively.Researchers have also investigated guidelines for mutation reduction [13,16, 21]. Zhang et al. [43] examined if operator-based mutant-selectiontechniques are superior to random mutation selection. All of the studiesmentioned, compare between groups of mutation operators, in the context ofmutation selection. They aim at reducing the number of mutants generatedwithout significantly loss of test effectiveness. In this thesis, however, wecompare between individual mutation operators. We are mainly interested inwhether using mutants generated by different mutation operators can lead toa significant influence on measuring test suit effectiveness. To the best of ourknowledge, we are the first to conduct a large scale empirical study on theinfluence of fault location on measuring test effectiveness6Chapter 3Assertions Are StronglyCorrelated with Test SuiteEffectivenessCode coverage is a popular test adequacy criterion in practice. Code coverage,however, remains controversial as there is a lack of coherent empirical evidencefor its relation with test suite effectiveness. More recently, test suite size hasbeen shown to be highly correlated with effectiveness. However, previousstudies treat test methods as the smallest unit of interest, and ignore potentialfactors influencing this relationship. We propose to go beyond test suite size,by investigating test assertions inside test methods. We empirically evaluatethe relationship between a test suite’s effectiveness and the (1) numberof assertions, (2) assertion coverage, and (3) different types of assertions.We compose 6,700 test suites in total, using 24,000 assertions of five real-world Java projects. We find that the number of assertions in a test suitestrongly correlates with its effectiveness, and this factor directly influencesthe relationship between test suite size and effectiveness. Our results alsoindicate that assertion coverage is strongly correlated with effectiveness anddifferent types of assertions can influence the effectiveness of their containingtest suites.This chapter was partially published as “Assertions Are Strongly Corre-lated with Test Suite Effectiveness” in the Proceedings of the joint meetingof the European Software Engineering Conference and the ACM SIGSOFTSymposium on the Foundations of Software Engineering (ESEC/FSE) [44].3.1 Experimental DesignThe goal of this chapter is to study the relationship between assertions andtest suite effectiveness. To achieve this goal, we design controlled experimentsto answer the following research questions:RQ1 Is the number of assertions in a test suite correlated with effectiveness?73.1. Experimental DesignRQ2 Is the assertion coverage of a test suite correlated with effectiveness?RQ3 Does the type of assertions in a test suite influence effectiveness?We examine these three main aspects of assertions in our study because(1) almost all test cases contain assertions, but the number of assertionsvaries across test suites (see Table 3.2); we aim to investigate if the numberof assertions plays a role in effectiveness, (2) the fraction of statements in thesource code executed and checked directly by assertions should intuitively beclosely related to effectiveness; we set out to explore if and to what degree thisis true; and (3) assertions have different characteristics, which may potentiallyinfluence a test suite’s effectiveness, such as their method of creation (e.g.,human-written, automatically generated), the type of arguments they asserton (e.g., boolean, string, integer, object), and the assertion method itself(e.g., assertTrue, assertEquals).All our experimental data is publicly available.33http://salt.ece.ubc.ca/software/assertion-study/83.1.ExperimentalDesignTable 3.1: Characteristics of the subject programs.ID Subjects Java SLOC Test SLOC Test cases Assertions Statement coverage Assertion coverage1 JFreeChart [8] 168,777 41,382 2,248 9,177 45% 30%2 Apache Commons Lang [1] 69,742 41,301 2,614 13,099 92% 59%3 Urban Airship Java Library [12] 35,105 11,516 503 1,051 72% 53%4 lambdaj [9] 19,446 4,872 310 741 93% 65%5 Asterisk-Java [2] 36,530 4,243 217 633 24% 10%Total/Average 329,600 103,314 5,892 24,701 45% 30%93.1. Experimental Design3.1.1 TerminologyTest case: a JUnit4 test method annotated with @Test. We use the terms‘test method’ and ‘test case’ interchangeably in this chapter.Test suite: the collection of a subject program’s test cases.Test suite size: number of test cases in a test suite.Master/original test suite: the test suite written by the developers of asubject program.3.1.2 Subject ProgramsTo automate data collection, we selected Java programs that use ApacheMaven4 as their build system, and JUnit4 as their testing framework. Weselect programs of different sizes to ensure the experiment results are notproject size dependent.Our set of subjects contains five projects in different application domains.JFreeChart [8] is a free Java chart library for producing charts. ApacheCommons Lang [1] is a package of Java utility classes for the classes that arein java.lang’s hierarchy. Urban Airship Java Library [12] is a Java client libraryfor the Urban Airship API. Lambdaj [9] is a Java project for manipulatingcollections in a pseudo-functional and statically typed way. The last subject,Asterisk-Java [2], is a free Java library for Asterisk PBX integration.The characteristics of these subject programs are summarized in Table4.1. Lines of source code are measured using SLOCCount [11]. Columns5–8 illustrate test suite size in terms of number of test methods, assertionquantity, statement coverage, and assertion coverage, of each subject’s mastertest suite, respectively. Table 3.2 presents descriptive statistics regarding thenumber of assertions per test case for the subject systems.53.1.3 ProcedureTo study the relationship between assertions and test suite effectiveness, alarge set of test suites with different assertion related properties are required.In this section, we present how the experiments are conducted with respectto each research question. We first discuss the variables of interest, thenexplain how test data are collected by generating new test suites, and finallydescribe how the results are analyzed.4http://maven.apache.org5We were surprised to see such high max numbers of assertions per test case, so we manuallyverified these numbers. For instance, the 114 max assertions for JFreeChart are in the testEqualstest method of the org.jfree.chart.plot.CategoryPlotTest class.103.1. Experimental DesignTable 3.2: Number of assertions per test case.ID Min 1st Q. Median 3rd Q. Max Mean σ1 0 1 2 4 114 4.1 6.72 0 1 3 6 104 5.1 7.43 0 0 1 2 42 2.1 3.64 0 1 2 3 21 2.8 3.25 0 1 2 3 17 3.0 2.9Effectiveness of Assertion Quantity (RQ1)In order to answer RQ1, we investigate three variables, namely, number oftest methods, number of assertions, and test suite effectiveness. We collectdata by generating test suites in three ways, (1) randomly, (2) controllingtest suite size, and (3) controlling assertion quantity. For each set of testsuites, we compute the correlation between the three variables.Number of test cases We implemented a tool that uses the JavaParser[6] library to identify and count the total number of test cases in a given testsuite.Number of assertions For each identified test case, the tool counts thenumber of test assertions (e.g., assertTrue) inside the body of the test case.Test suite effectiveness Effectiveness captures the fault detection abilityof a test suite, which can be measured as a percentage of faults detectableby a test suite. To measure the fault detection ability of a test suite, alarge number of known real faults are required for each subject, which ispractically unachievable. Instead, researchers generate artificial faults thatresemble developer faults using techniques such as mutation testing. Inmutation testing, small syntactical changes are made to random locations inthe original program to generate a large number of mutants. The test suiteis then run against each mutant. A mutant is killed if any of the test caseassertions fail or the program crashes.Mutation score. The mutation score, calculated as a percentage of killedmutants over total number of non-equivalent mutants, is used to estimate faultdetection ability of a test suite. Equivalent mutants are syntactically differentbut semantically the same as the origin program, and thus undetectable byany test case. Since there is no trivial way of identifying equivalent mutants,similar to other studies [28], we treat all mutants that cannot be detected by a113.1. Experimental Designprogram’s original (master) test suite, as equivalent mutants when calculatingmutation scores for our generated test suites.Mutations are produced by transforming a program syntactically throughmutation operators, and one could argue about the eligibility of using themutation score to estimate a test suite’s effectiveness. However, mutationtesting is extensively used as a replacement of real fault detection ability in theliterature [18, 28, 33]. There is also empirical evidence confirming the validityof mutation testing in estimating test suite effectiveness [14, 15, 19, 29].We use the open source tool PIT [10] to generate mutations. We testedeach of our subject programs to ensure their test suites can successfullyexecute against PIT. We use PIT’s default mutation operators in all of ourexperiments.Generating test suites To answer RQ1, we generate test suites in threedifferent ways, from the master test suites of the subject programs.Random test suites. We first generate a set of test suites by randomlyselecting a subset of the test cases in the master test suite, without replace-ment. The size of each generated test suite is also randomly decided. Inother words, we generate this set of test suites without controlling on testsuite size or assertion quantity.Controlling the number of test methods. Each test case typicallyhas one or more assertions. A test suite with more test cases is likely tocontain more assertions, and vice versa. From our observations, if test suitesare randomly generated, there exists a linear relationship between test suitesize and the number of assertions in the suites. If there is a linear relationshipbetween two properties A (e.g., assertion quantity) and B (e.g., suite size),a relationship between A and a third property C (e.g., effectiveness) caneasily transform to a similar relationship between B and C through transitiveclosure. To remove such indirect influences, we generate a second set of testsuites by controlling the size. More specifically, a target test suite contains allof the test methods but only a subset of the assertions from the master testsuite. Based on the total number of assertions in the master test suite, wefirst select a base number b, which indicates the size of the smallest test suite,and a step number x, which indicates size differences between test suites.Therefore, the i-th test suite to be generated, contains all of the test casesbut only b+ x ∗ i randomly selected assertions of the master test suite.Controlling the number of assertions. We also generate another setof test suites by controlling on assertion quantity. To achieve this, we firstassign test cases to disjoint buckets according to the number of assertions they123.1. Experimental Designcontain. For instance, for JFreeChart, test cases are assigned to three disjointbuckets, where bucket low contains test cases with 2 or less assertions, bucketmiddle contains test cases with 3 or 4 assertions, and bucket high containsthe rest of test cases which have 5 or more assertions. We divide test cases inthis way such that each bucket has a comparable size. Then we generate 100test suites from each of the buckets randomly without replacement. Followingthis process, with a similar test suite size, test suites generated from buckethigh always contain more assertions than test suites generated from bucketmiddle, and so forth.Correlation analysis For RQ1, we use Pearson and Kendall’s correlationto quantitively study the relationship between test suite size, assertion quan-tity, and test suite effectiveness. The Pearson correlation coefficient indicatesthe strength of a linear relationship between two variables. The Kendall’scorrelation coefficient measures the extent to which, as one variable increases,the other variable tends to increase, without requiring that increase to berepresented by a linear relationship.Effectiveness of Assertion Coverage (RQ2)To answer RQ2, we measure a test suite’s assertion coverage, statementcoverage, and effectiveness. We collect data by first looking at the set oftest suites which were randomly generated for RQ1, then generate a new setof test suites by controlling their assertion coverage. For each of the twosets of test suites, we study and compare the correlations between the threevariables using the same analysis methods as described in Section 3.1.3.Explicit mutation score Not all detectable faults in a program are de-tected by test assertions. From our observations, mutants can either beexplicitly killed by assertions or implicitly killed by program crashes. Pro-grams may crash due to unexpected exceptions. Program crashes are mucheasier to detect as they do not require dedicated assertions in test cases. Onthe other hand, all the other types of faults that do not cause an obviousprogram crash, are much more subtle and require proper test assertions fortheir detection. Since the focus of our study is on the role of assertions ineffectiveness, in addition to the mutation score, we also compute the explicitmutation score, which measures the fraction of mutants that are explicitlykilled by the assertions in a test suite. Table 3.3 provides mutation datain terms of the number of mutations generated for each subject, numberof mutants killed by the test suites, number of mutants killed only by test133.1. Experimental DesignTable 3.3: Mutation data for the subject programs.ID Mutants Killed (#) Killed by As-sertions (#)Killed by As-sertions (%)1 34,635 11,299 7,510 66%2 11,632 9,952 7,271 73%3 4,638 2,546 701 28%4 1,340 1,084 377 35%5 4,775 957 625 65%assertions (e.g., excluding crashes), and the percentage of mutants killed byassertions with respect to the total number of killed assertions.From what we have observed in our experiments, PIT always generatesthe same set of mutants for a piece of source code when executed multipletimes. Thus, to measure the explicit mutation score of a test suite, we removeall assertions from the test suite, measure its mutation score again, and thensubtract the fraction of implicit killed mutants from the original mutationscore.Assertion coverage Assertion coverage, also called checked coverage [38],measures the fraction of statements in the source code executed via thebackward slice of the assertion statements in a test suite.We use the open source tool JavaSlicer [7] to identify assertion checkedstatements, which are statements in the source code executed through theexecution of assertions in a test suite. JavaSlicer is an open-source dynamicslicing tool, which can be used to produce traces of program executions andoffline dynamic backward slices of the traces. We automatically identifychecked statements of a test suite by (1) identifying all assertion statementsand constructing slicing criteria, (2) using JavaSlicer to trace each test classseparately, and (3) mining the traces computed in the previous step to identifydynamic backward slices of the assertions, and finally (4) since each backwardslice of an assertion includes statements from the test case, calls to the JUnitAPIs, and statements from the source code, we filter out the data to keeponly the statements pertaining to the source code.For large test suites, we observed that using JavaSlicer is very timeconsuming. Thus, we employ a method to speed up the process in ourexperiments. For all the test classes in each original master test suite,we repeat steps 1–4 above, to compute the checked statements for eachtest method individually. Each statement in the source code is uniquely143.1. Experimental Designidentified by its classname and line number and assigned an ID. We then saveinformation regarding the checked statements of each test method into a datarepository. Once a new test suite is composed, its checked statements canbe easily found by first identifying each test method in the test suite, thenpulling the checked statements of the test method from the data repository,and finally taking a union of the checked statements. The assertion coverageof a generated test suite is thus calculated as the total number of checkedstatements of the suite divided by the total number of statements.Statement coverage Unlike assertion coverage, which only covers whatassertion statements execute, statement coverage measures the fraction ofthe source code covered through the execution of the whole test suite.In this chapter, we select statement coverage out of the traditional codecoverage metrics as a baseline to compare with assertion coverage. Thereason behind our selection is twofold. First, statement coverage is one of themost frequently used code coverage metrics in practice since it is relativelyeasy to compute and has proper tool support. Second, two recent empiricalstudies suggest that statement coverage is at least as good as any othercode coverage metric in predicting effectiveness. Gopinath et al. [26] foundstatement coverage predicts effectiveness best compared to block, branch, orpath coverage. Meanwhile, Inozemtseva and Holmes [28] found that strongerforms of code coverage (such as decision coverage or modified conditioncoverage) do not provide greater insights into the effectiveness of the suite.We use Clover [3], a highly reliable industrial code coverage tool, to measurestatement coverage.Generating test suites To answer RQ2, we first use the same set of testsuites that were randomly generated for RQ1. In addition, we composeanother set of test suites by controlling assertion coverage. We achieve thisby controlling on the number of checked statements in a test suite. Similarly,based on the total number of checked statements in the master test suite ofa program, we predefine a base number b, which indicates assertion coverageof the smallest test suite, and a step number x, which indicates assertioncoverage differences between the test suites. When generating a new testsuite, a hash set of the current checked statements is maintained. If thetarget number for checked statements is not reached, a non-duplicate testmethod will be randomly selected and added to the test suite. To avoid toomany trials of random selection, this process is repeated until the test suitehas [b+ x ∗ i, (b+ x ∗ i) + 10] checked statements. This way, the i-th target153.1. Experimental Designtest suite has an assertion coverage of (b + x ∗ i)/N , where N is the totalnumber of statements in the source code.Effectiveness of Assertion Types (RQ3)RQ3 explores the effectiveness of different characteristics of test assertions.To answer this research question, we first automatically assign assertionsto different categories according to their characteristics, then generate a setof sample test suites for each category of assertions, and finally conductstatistical analysis on the data collected from the generated test suites.Assertion categorization Assertions can be classified according to theircharacteristics. Some of these characteristics may be potential influencefactors to test suite effectiveness. We categorize assertions in three ways:Human-written versus generated. Human-written test cases containprecise assertions written by developers about the expected program be-haviour. On the other hand, automatically generated tests contain genericassertions. Commonly, it is believed that human-written test cases havea higher fault detection ability than generated assertions. We test thisassumption in our work.Assertion content type. Test assertions either check the value ofprimitive data type or objects of different classes. We further classify Java’sprimitive data types into numbers (for int, byte, short, long, double, andfloat), strings (for char and String), and booleans. This way, depending onthe type of the content of an assertion, it falls into one of the followingclasses: number-content-type, string-content-type, boolean-content-type, orobject-content-type. We explore whether these assertion content types havean impact on the effectiveness of a test suite.For assertion content type, we apply dynamic analysis to automaticallyclassify the assertions in a given test suite to the different categories. Wefirst instrument test code to probe each assert statement for the type ofcontent it asserts on. Then, we run the instrumented test code, and usethe information collected to automatically assign assertions to the differentcontent type categories.Assertion method type. It is also possible to categorize assertionsaccording to their actual method types. For instance, assertTrue andassertFalse, assertEquals and assertNotEquals, and assertNull andassertNotNull can be assigned to different categories. We investigate ifthese assertion method types have an impact on effectiveness.163.2. ResultsFor assertion method types, we parse the test code and syntacticallyidentify and classify assertions to different assertion method type classes.Generating test suites Under each assertion categorization, for each as-sertion type, we compose 50 sample test suites, each containing 100 assertions.A sample test suite contains all test methods in the master test suite, but only100 randomly selected assertions of the target type. For instance, a sampletest suite of the type string-content-type will contain all the test methodsin the master test suite but only 100 randomly selected string-content-typeassertions.To quantitively compare the effectiveness between human-written andgenerated assertions, for each subject program, we generate (1) 50 sampletest suites, each containing 100 human-written assertions from the mastertest suite, and (2) 50 sample test suites, each containing 100 automaticallygenerated assertions using Randoop [36], a well-known feedback-directed testcase generator for Java. We use the default settings of Randoop.Analysis of variances For assertion content type and assertion methodtype, since there are multiple variables involved, we use the One-Way ANOVA(analysis of variance) statistical method to test whether there is a significantdifference in test suite effectiveness between the variables. Before we conductthe ANOVA test, we used the Shapiro-Wilk test to pretest the normality of ourdata, and Levene’s test to pretest the homogeneity of their variances. Bothwere positive. ANOVA answers the question whether there are significantdifferences in the population means. However, it does not provide anyinformation about how they differ. Therefore, we also conduct a Tukey’sHonest Significance Test to compare and rank the effectiveness of assertiontypes.3.2 ResultsIn this section, we present the results of our experiments.3.2.1 Effectiveness of Assertion QuantityIgnoring test suite size Figure 3.1 depicts plots of our collected datafor JFreeChart.6 Figures 3.1a and 3.1b show that the relationship between6Note that we observed a similar trend from the other subjects, and only include plotsfor JFreeChart due to space limitations.173.2. Results0 500 1000 1500 20000.00.20.40.60.81.0Number of Test CasesMutation Score(a)0 2000 4000 6000 80000.00.20.40.60.81.0Number of Assertions(b)0 500 1000 1500 200002000400060008000Number of Test CasesNumber of Assertions(c)Figure 3.1: Plots of (a) suite size versus effectiveness, (b) asser-tion quantity versus effectiveness, and (c) suite size versus asser-tion quantity, for the 1000 randomly generated test suites fromJFreeChart. The other four projects share a similar pattern con-sistently.183.2. Resultstest suite size and effectiveness is very similar to the relationship betweenassertion quantity and effectiveness. As the plot in Figure 3.1c shows, thereexists a linear relationship between the number of test methods and thenumber of assertions, in the 1000 randomly generated test suites.Table 3.4: Correlation coefficients between test suite size and effec-tiveness (m), and assertion quantity and effectiveness (a). ρp showsPearson correlations and ρk represents Kendall’s correlations.Subject ID ρp(m) ρp(a) ρk(m) ρk(a) p-value1 0.954 0.954 0.967 0.970< 2.2e− 162 0.973 0.973 0.969 0.9693 0.927 0.927 0.917 0.9174 0.929 0.928 0.912 0.9305 0.945 0.947 0.889 0.894Table 3.4 shows the Pearson (ρp ) and Kendall’s (ρk) correlations betweeneffectiveness with respect to suite size (m) and assertion quantity a, for thetest suites that are randomly generated for all the five subjects. As the tableshows, there is a very strong correlation between number of assertions in a testsuite and the test suite’s effectiveness, and the correlation coefficients are veryclose to that of suite size and effectiveness. This is consistent with the plotsof Figure 3.1. The correlations between assertion quantity and effectivenessare slightly higher or equal to the correlations between the number of testmethods and effectiveness.Finding 1: Our results indicate that, without controlling for test suite size,there is a very strong correlation between the effectiveness of a test suiteand the number of assertions it contains.Controlling for test suite size Table 3.5 shows our results when wecontrol for test suite size. Column 2 shows the number of assertions in thesmallest test suite, and column 3 shows the difference in assertion quantitybetween generated test suites. Columns 3 and 4 present the Pearson andKendall’s correlations, respectively, between the assertion quantity and theeffectiveness of the test suites that are generated from the five subjects bycontrolling test suite size. As the high correlation coefficients indicate inthis table, even when test suite size is controlled for, there is a very strongcorrelation between effectiveness and the number of assertions.193.2. ResultsTable 3.5: Correlations between number of assertions and suiteeffectiveness, when suite size is controlled for.Subject ID Base Step ρp(a) ρk(a) p-value1 1,000 50 0.976 0.961< 2.2e− 162 100 100 0.929 0.9703 0 10 0.948 0.8464 100 10 0.962 0.8395 100 5 0.928 0.781Finding 2: Our results suggest that, there is a very strong correlationbetween the effectiveness of a test suite and the number of assertions itcontains, when the influence of test suite size is controlled for.0 200 400 6000.10.20.30.40.50.60.7Number of Test CasesMutation ScoreHighMiddleLowFigure 3.2: Plot of mutation score against suite size for test suitesgenerated from assertion buckets low, middle, and bucket highfrom JFreeChart. The other four projects share a similar patternconsistently.Controlling for assertion quantity Figure 3.2 plots the effectiveness ofthe test suites generated by controlling for the number of assertions. Three203.2. Resultsbuckets of high, middle and low in terms of the number of assertions wereused for generating these test suites (see Section 3.1.3). From low to high,each bucket contained 762, 719, and 742 test cases in total, and the averagenumber of assertions per test case was 0.9, 2.5, and 9.1, respectively. Fromthe curves in the plot, we can see that the effectiveness increases as thenumber of test methods increase. However, comparing the curves, there is aclear upward trend in a test suite’s effectiveness as its assertion quantity levelincreases. For every given test suite taken from the lower curve on the plot,there exits a test suite with the same suite size that has a higher effectivenessbecause it contains more assertions.Finding 3: Our results indicate that, for the same test suite size, assertionquantity can significantly influence the effectiveness.3.2.2 Effectiveness of Assertion CoverageTo answer RQ2, we first computed the assertion coverage, statement coverage,and mutation score of the randomly generated test suites (see section 3.1.3).Figure 3.4a plots our results. The two fitted lines both have a very highadjusted R2 and p-value smaller than 2.2e− 16; this indicates a very strongcorrelation between assertion coverage and effectiveness as well as statementcoverage and effectiveness. The plot also shows that a test suite havingthe same assertion coverage as another test suites’s statement coverage, ismuch more effective in detecting faults. Compared with statement coverage,assertion coverage is a more sensitive predictor of test suite effectiveness.Figure 3.4b plots assertion coverage against number of assertions in a testsuite. From the plot, assertion coverage of a test suite increases as test suitesize increases. However, the increasing rate of assertion coverage decreasesas test suite size increases. There is a strong increasing linear relationshipbetween assertion coverage and test suite effectiveness. Therefore, it isexpected that, test suite effectiveness increases as test suite size increases butwith a diminishing increasing rate, which is again consistent with our resultsin section 3.2.1.Finding 4: Our results suggest that, assertion coverage is very stronglycorrelated with test suite effectiveness. Also, ignoring the influence ofassertion coverage, there is a strong correlation between statement coverageand the effectiveness.213.2.Results4.2 12.5 20.90.20.40.60.8R^2 = 0.9927JFreeChart9.9 25.5 41.40.20.40.6R^2 = 0.9936Apache Commons Lang6.7 20.2 33.60.10.30.50.7R^2 = 0.9417Urban Airship Java Library9.5 28.5 47.50.20.40.60.8R^2 = 0.9631lambdaj1.6 3 4.3 7.30.00.20.40.60.8R^2 = 0.9459Asterisk-Java4.2 12.5 20.90.20.40.60.8R^2 = 0.8321JFreeChart9.9 25.5 41.40.10.20.30.40.5R^2 = 0.9901Apache Commons Lang6.7 20.2 33.60.000.100.20R^2 = 0.8327Urban Airship Java Library9.5 28.5 47.50.000.100.200.30R^2 = 0.8634lambdaj1.6 3 4.3 7.30.00.20.40.60.8R^2 = 0.8014Asterisk-JavaMutation ScoreExplicit Mutation ScoreAssertion coverageFigure 3.3: Mutation score (above) and explicit mutation score (below) plot against assertion cover-age for the five subject programs. Each box represents the 50 test suites of a given assertion coveragethat were generated from the original (master) test suite for each subject. The Kendall’s correla-tion is 0.88–0.91 between assertion coverage and mutation score, and 0.80–0.90 between assertioncoverage and explicit mutation score.223.2.ResultsTable 3.6: Statistics of test suites composed at different assertion coverage levels.Subject ID Assertion Coverage Stat. Coverage Corr. Mutation Score Statement Coverageρgeneral ρexplicit general explicit14.2% 0.62 0.17 0.21 0.13 15%8.4% 0.60 0.22 0.35 0.24 23%12.5% 0.59 0.11 0.48 0.35 30%16.7% 0.63 0.33 0.61 0.45 36%20.9% 0.58 0.26 0.73 0.61 41%25.1% 0.71 0.32 0.85 0.75 47%29.9% 0.67 0.49 0.17 0.12 21%17.7% 0.67 0.51 0.30 0.22 34%25.5% 0.65 0.48 0.43 0.31 46%33.7% 0.66 0.50 0.56 0.40 57%41.4% 0.58 0.27 0.69 0.50 68%36.7% 0.62 0.06 0.18 0.05 19%13.5% 0.74 0.06 0.33 0.10 30%20.2% 0.76 0.01 0.46 0.14 39%26.9% 0.75 0.07 0.59 0.17 48%33.6% 0.76 0.05 0.70 0.20 55%49.5% 0.76 0.21 0.17 0.06 19%19.0% 0.73 0.33 0.31 0.10 32%28.5% 0.70 0.30 0.46 0.15 45%38.0% 0.63 0.23 0.61 0.20 58%47.5% 0.50 0.10 0.76 0.26 70%51.6% 0.73 0.63 0.10 0.06 4%3.0% 0.76 0.35 0.23 0.15 8%4.3% 0.70 0.38 0.41 0.28 12%5.8% 0.60 0.25 0.57 0.43 16%7.3% 0.62 0.24 0.71 0.56 19%233.2. Results0.00.20.40.60.81.0Coverage0 10 20 30 40 50Mutation ScoreAssertion coverageStatement coverageR^2 - 0.9836R^2 - 0.9881(a)0 2000 4000 6000 8000Number of Assertions051015202530Assertion Coverage(b)Figure 3.4: Plots of (a) mutation score against assertion cover-age and statement coverage, (b) assertion coverage against asser-tion quantity, for the 1000 randomly generated test suites fromJFreeChart.Controlling for assertion coverage Figure 3.3 shows box plots of ourresults for the test suites generated by controlling their assertion coverage.The adjusted R2 value for each regression line is shown in the bottom rightcorner of each box plot. It ranges from 0.94 to 0.99 between assertion coverageand mutation score, and 0.80 to 0.99 between assertion coverage and explicitmutation score. This indicates assertion coverage can predict both mutationscore and explicit mutation score well.Table 3.6 summarizes statistics for these test suites. Column 3 containsthe Kendall’s correlations between statement coverage and mutation score(0.50–0.76), column 4 presents the Kendall’s correlations between statementcoverage and explicit mutation score (0.01–0.63). When assertion coverage iscontrolled for, there is a moderate to strong correlation between statementcoverage and mutation score, and only a low to moderate correlation betweenstatement coverage and explicit mutation score. For instance, only about1/3 of the mutants generated for Urban Airship Library (ID 5) and lambdaj(ID 4) are explicitly detectable mutants; correspondingly there is only aweak correlation (0.01–0.33) between their statement coverage and explicitmutation score. A higher fraction (≈ 2/3) of the mutants generated forthe other three subjects are explicitly detectable mutants, and thus thecorrelation between their statement coverage and explicit mutation score243.2. Resultsincreases significantly (from 0.11 to 0.63).Columns 5–7 in Table 3.6 pertain to the average mutation score, averageexplicit mutation score, and average statement coverage of the test suitesat each assertion coverage level, respectively. As the results show, a slightincrease in assertion coverage can lead to an obvious increase in the mutationscore and explicit mutation score. For instance, for JFreeChart (ID 1), whenassertion coverage increases by around 4%, the mutation score increases byaround 12.4% and explicit mutation score increases by around 11%. Onthe other hand, a 4% increase in the statement coverage does not alwaysincrease either mutation score or explicit mutation score. This shows againthat assertion coverage is a more sensitive indicator of test suite effectiveness,compared to statement coverage.Finding 5: Our results suggest that, assertion coverage is capable ofpredicting both mutation score and explicit mutation score. With asser-tion coverage controlled for, there is only a moderate to strong correlationbetween statement coverage and mutation score, and a low to moderatecorrelation between statement coverage and explicit mutation score. Testsuite effectiveness is more sensitive to assertion coverage than statementcoverage.3.2.3 Effectiveness of Assertion TypesInitial StudyTo answer RQ3, we first examined the 9,177 assertions of JFreeChart.Assertion generation strategy Figure 3.5a plots the effectiveness ofhuman-written test suites and Randoop generated test suites against assertionquantity. As we can observe, the effectiveness of human-written and generatedtest suites both increase as the assertion quantity increases. However, theeffectiveness of the generated test suites gets saturated much faster thanhuman-written test suites.From our observations of the composed test suites, the 50 human-writtensample test suites are effective in killing mutants, while the 50 generated testsuites can hardly detect any mutant. We increased the assertion quantity inthe sample test suites to 500, but still saw the same pattern.Finding 6: Our results indicate that, human-written test assertions arefar more effective than automatically generated test assertions.253.2. Results0 2000 4000 6000 80000.00.20.40.60.81.0Number of AssertionsMutation Score Human-writtenGenerated(a)Object Boolean Number String0.380.420.460.501,307 4,940 1,754 208Mutation ScoreAssertion Content Type(b)TrueOrFalse NullOrNot EqualsOrNot0.380.400.420.445,229 167 3,481Mutation ScoreAssertion Method Type(c)Figure 3.5: Plots of (a) assertion quantity versus effectiveness ofhuman-written and generated tests, (b) assertion content typesversus effectiveness, and (c) assertion method types versus effec-tiveness. In (b) and (c), each box represents the 50 sample testsuites generated for each type; the total number of assertions ofeach type are indicated in red.Assertion content type Assertions are also classified based on the typesof the content they assert on. Figure 3.5b box plots the effectiveness ofthe sample test suites that exclusively contain assertions on object, boolean,number, or string types. Tables 3.7 and 3.8 show the ANOVA and the Tukey’sHonest Significance test, respectively. The F value is 1544 with a p-value263.2. ResultsTable 3.7: One-Way ANOVA on the effectiveness of assertion con-tent types and actual assertion types.Df Sum Sq Mean Sq F value Pr(>F)Assertion Content TypesType 3 0.15675 0.05225 1544 <2e-16Residuals 196 0.00663 0.00003Assertion Method TypesType 2 0.01398 0.006988 87.87 <2e-16Residuals 147 0.01169 0.000080very close to 0, thus we can confidently reject the null hypothesis of equalvariances (effectiveness) for the four assertion content types. Table 3.8 showsthe estimated difference in mutation score in column 2, and the 95% confidenceinterval of the difference in columns 3 and 4. The Tukey’s test indicates thatthere is a significant difference between the effectiveness of assertions thatassert on boolean/object, string, and number types. Assertions that asserton boolean types are as effective as assertions that assert on objects.Assertion method type Assertions can also be classified by their actualmethod types. Figure 3.5c plots the effectiveness of the sample test suitesthat belong to the three assertion method types. In this chapter, we did notstudy assertSame and assertNotSame, because there were only 27 of them inJFreeChart, which is a low number to be representative. The bottom half oftables 3.7 and 3.8, present the ANOVA and Tukey’s Honest Significance test,respectively, for assertion method types. The F value is 87.87 with a p-valuevery close to 0, thus we can reject the null hypothesis of equal variances(effectiveness) for the three assertion method types. The Tukey’s test showsthat there exists a significant difference between the effectiveness of the threeassertion method types.Followup StudyOne year after we conducted the initial study, we conducted a followup studyon assertion content type and method type. We examined the 17182 assertionsof JFreeChart, Urban Airship Java Library, lambdaj, and Asterisk-Java ontheir most up to date versions.273.2. ResultsString Boolean Number Object400800JFreeChart# of mutants killedString Boolean Number Object100300Java-LibraryString Boolean Number Object50100150200Lambdaj# of mutants killedString Boolean Number Object050150250Asterisk(a)True/False Equals/Not Null/Not2006001000JFreeChart# of mutants killedTrue/False Equals/Not Null/Not0100300Java-libraryTrue/False Equals/Not Null/Not50100150Lambdaj# of mutants killedTrue/False Equals/Not Null/Not100200300400Asterisk(b)Figure 3.6: Plots of (a) number of mutants killed versus asser-tion content types, (b) number of mutants killed versus assertionmethod types. Each box represents the 50 sample test suites gen-erated for each assertion type. 283.2. ResultsTable 3.8: Tukey’s Honest Significance Test on the effectiveness ofassertion content types and assertion method types. Each of thesample test suites used for the comparison contains 100 assertionsof a target type.Types diff lwr upr p adjAssertion Content TypesBoolean vs. Ob-ject-0.0002 -0.0032 0.0028 0.9985Number vs.Boolean-0.0470 -0.0500 -0.0440 0.0000String vs. Num-ber-0.0156 -0.0186 -0.0126 0.0000Assertion Method TypesassertNull/Not -0.0103 -0.0145 -0.0060 1e-07vs. assert-True/FalseassertEquals/Not -0.0133 -0.0175 -0.0091 0e+00vs. assert-Null/NotAssertion content typeFigure 3.6a box plots the effectiveness of the sample test suites that exclusivelycontain assertions on object, boolean, number, or string types. Test suiteswith object-content-type assertions are the most effective in JFreeChart, Java-library, and Lambdaj, and second effective in Asterisk. Test suites withstring-content-type assertions test suites are least effective in JFreeChart,but most effective in Asterisk. The results for JFreeChart is different fromthe initial study. Therefore, there is not a consistent pattern between theeffectiveness of assertions assert on different content types.Table 4.3 shows the One-Way ANOVA test on the effectiveness of assertioncontent types for project JFreeChart. The F value is 0.128 with P valueequals to 0.943 (close to 1) tells there is not a significant difference betweenthe effectiveness of different content types of assertions. We receive a samemessage from the tests conducted on the rest three subjects.Finding 7: There is not a consistent ranking nor a statistical significantdifference between assertion content types in terms of their test effectiveness.293.2. ResultsTable 3.9: One-Way ANOVA on the effectiveness of actual asser-tion types and assertion content types on JFreeChart.Df Sum Sq Mean Sq F value Pr(>F)Assertion Content TypesType 3 18883 6294 0.128 0.943Residuals 196 9603200 48996Assertion Method TypesType 2 659 330 0.005 0.995Residuals 147 10721504 72935Assertion method typeFigure 3.6b plots the effectiveness of the sample test suites that belong tothe three assertion method types. Test suites with assertTrue/False aremost effective in JFreeChart, second effective in Java-library and Lambdaj,but least effective in Asterisk. Test suites with assertEquals/Not are mosteffective in all of the projects except for second effective in JFreeChart. Testsuites with assertNull/Not are least effective in all of the projects except forsecond least effective in Asterisk. The results for JFreeChart is also differentfrom the initial study. Therefore, there is not a consistent ranking in theeffectiveness of assertions of different method types.Table 4.3 shows the One-Way ANOVA test on the effectiveness of assertionmethod types for project JFreeChart. The F value is 0.005 with P valueequals to 0.995 (close to 1) tells there is not a significant difference betweenthe effectiveness of different content types of assertions. We get a similar testresult for the rest three subjects.Finding 8: There is not a consistent ranking nor a statistical significantdifference between assertion method types in terms of their test effectiveness.30Chapter 4Fault Type and LocationInfluence Measuring Test SuiteEffectivenessTest suite effectiveness is measured by assessing the portion of faults thatcan be detected by tests. To precisely measure a test suite’s effectiveness,one needs to pay attention to both tests and the set of faults used to measureeffectiveness. Instead of only focusing on the testing side, we propose toinvestigate test suite effectiveness also from fault types (the ways faults aregenerated) and fault location. We empirically evaluate the relationship be-tween test suite effectiveness, assertions, and faults based on 17,182 assertionsand 18,820 artificial faults generated from four real-world Java projects. Ourresults indicate that fault type and statement type where the fault is locatedcan significantly influence a test suite’s effectiveness. Assessing test suiteeffectiveness without paying attention to the type and distribution of faultscan provide misleading results.4.1 Experimental DesignOur goal in this study is to answer the following research questions throughcontrolled experiments:RQ1 Is measuring test effectiveness influenced by fault type?RQ2 Is measuring test effectiveness influenced by fault location?4.1.1 Subject ProgramsWe select four subject programs, which were also used in the previous chapter.We conduct our experiments on the latest versions of these subjects. Thecharacteristics of these subject programs are summarized in Table 4.1. Columnversion contains the version number and Github [5] commit id of the subjects.Lines of source code and test code are measured using SLOCCount [11].314.1. Experimental DesignThe table also shows the total number of test assertions, and the number ofmutants generated, for each subject program.324.1.ExperimentalDesignTable 4.1: Characteristics of the subject programs.ID Subjects Version Java SLOC Test SLOC Assertions Mutants1 JFreeChart [8] jfreechart-1.0.19 168,777 41,382 11,915 13,7442 Urban Airship Java Library [12] 96cedd21fbe37ddbc555008c353c3a8736fda0e3 35,105 11,516 1,935 2,7193 lambdaj [9] bd3afc7c084c3910454a793a872b0a76f92a43fd 19,446 4,872 2,674 1,3084 Asterisk-Java [2] 5e9b16f2816cf5e6d6c6fa81e924ccdf3ead197f 36,530 4,243 658 1,049Total/Average 329,600 103,314 17,182 18,820334.1.ExperimentalDesignTable 4.2: Default mutation operators provided by PIT.ID Mutation operator DefinitionCBM Conditionals Boundary Mutator replaces the relational operators <, <=, >, >=.IM Increments Mutator mutate increments, decrements and assignment increments and decrements of local variables (stack variables).INM Invert Negatives Mutator inverts negation of integer and floating point numbers.MM Math Mutator replaces binary arithmetic operations for either integer or floating-point arithmetic with another operationNCM Negate Conditionals Mutator mutate all conditionals ==, !=, <=, >=, <, >.RVM Return Values Mutator mutates the return values of method calls.VMC Void Method Calls Mutator removes method calls to void methods.344.1. Experimental Design4.1.2 ProcedureFault Types (RQ1)RQ1 explores how easy different types of faults can be detected by assertions.To answer this question, we first observe the distribution of mutants generatedby different mutation operators. Then, we automatically sample subsets ofmutants for each type, with size controlled for, and statistically compare thenumber of detectable mutants in the subsets of the original test suite. Finally,we remove each type of mutant separately, and examine if they cause anysignificant difference in measuring test effectiveness.Simulating fault types We use mutants to simulate real faults. Mutantshave been widely adopted to substitute real faults in the literature [18, 27, 28,33, 37, 39]. There is also empirical evidence for mutants being representablefor real faults [14, 15, 19, 29]. In mutation testing, mutation operators areused to simulate different types of programming errors. They define the rulesof how mutants are constructed from the original program. Therefore, weassign mutants to different fault types according to their mutation operators.Again, we use PIT [10] to generate mutants. We use the (seven) defaultmutation operators provided by PIT, shown in table 4.2.Assertions vs. detectable mutants matrix To speed up our experi-mentation, it is necessary to construct a matrix, which maps from assertions(and crashing statements) to their detected mutants, or vice versa. We noticed(1) PIT stops executing the rest of the test cases after a test case first detectsa mutant, (2) PIT stops executing the rest of statements after a statementcrashes the program, and (3) PIT does not provide a fine grained mappingbetween mutants generated and test cases, nor test assertions. In other word,PIT does not provide the functionality to create such a matrix. We extendedPIT to construct this matrix. We first modified PIT so that it always executesall test cases even after any test case fails. We then instrument the testsuite by surrounding its test statements with JUnit ErrorCollector [4]. TheErrorCollector allows the execution of a test to continue even after a testfailure. This way, we record failure information during the execution of alltests against each mutant.Subsets of mutants Some mutation operators generate more mutants thanothers. Also, a mutation operator may generate more mutants for a differentprogram. To quantitatively compare the mutants generated by different354.1. Experimental Designmutation operators, we randomly sample subsets of mutants generated byeach mutation operator by controlling on the mutant quantity. For eachsubject program, we randomly select 50 subsets of mutants generated by eachmutation operator. Each subset contains 10 randomly selected mutants ofthe target type without replacement. We picked 10 according to the leastnumber of mutants generated by the mutation operators, so that each subsetdoes not contain too many duplicate mutants.Omitting mutation operator In addition to comparing between mutantsgenerated by each of the mutation operators, we compare between mutantsgenerated by all of the mutation operators and mutants generated by omittingeach mutation operator, separately. To achieve this goal, we first sample 50subsets of mutants of size 100 from the mutants generated by all mutationoperators. Then, we leave out each of the mutation operators in turn, andsample 50 subsets of mutants of size 100 from the mutants generated by therest of mutation operators. We evaluate if there is any statistically significantinfluence in test effectiveness when a mutation operator is missing.Fault Location: Types of Statement Mutated (RQ2)RQ2 studies whether fault location influences the test effectiveness measure-ment. To answer this question, we examine if the statement type where aprogram is mutated affects how easy a mutant can be detected. We startwith observing the distribution of mutants in different types of statements.Then, we correlate the distribution with the mutation score of the subjects.Types of statement Mutation happens in different types of statements.To simplify our experimental design, we classify program statements asfollows:1. Conditional statements can change the control flow of the programand include if-else, for, and while constructs.2. Return statements are statements with return keyword.3. Statements are normal statements that are neither conditional state-ments nor return statements.Mutants can also be located in nested statements. For example, a state-ment can be nested inside an if-else statement. Therefore, we also classifymutants based on their level of nesting.364.2. ResultsTable 4.3: One-Way ANOVA on the effectiveness of actual asser-tion types and assertion content types on JFreeChart.Df Sum Sq Mean Sq F value Pr(>F)Assertion Content TypesType 3 18883 6294 0.128 0.943Residuals 196 9603200 48996Assertion Method TypesType 2 659 330 0.005 0.995Residuals 147 10721504 72935Subsets of mutants Different number of mutants can be generated indifferent types of statements. Thus, as we did for RQ1, we compose subsetsof mutants in different types of statements by controlling on the mutantsquantity. For each subject program, we randomly sample 50 subsets ofmutants which are generated by mutating a program in each statement type.Each subset contains 100 randomly selected mutant, without replacement.We pick 100 according to the total number of mutants generated in each typeof statement for the subjects.4.2 Results4.2.1 Fault Type (RQ1)Distribution of fault typesFigure 4.1 shows bar charts of the distribution of the number of mutablelocations, number of mutants generated, and number of mutants detectedby different mutation operators. Note that we did not include the mutationoperator Invert Negatives Mutator in this figure, since it generates very fewmutants in the subject programs. Each stacked bar as a whole illustratesthe number of mutable locations of the mutation operator in a program’sproduction code. It also shows the number of mutants actually generated anddetected, separately. The correlation scores between the number of mutablelocations and mutants generated are listed on the top left conner of thebar charts. The figure shows Return Values Mutator, Negate ConditionalsMutator, and Void Method Calls Mutator always generate more mutantsthan Conditional boundary Mutator, Increments Mutator, and Math Mutator.And the distribution of mutants generated by the mutation operators alignswith the distribution of possible mutable locations of the operators. Thecorrelation scores range betwee [0.9801–0.6559], which indicates that thenumber of mutants generated are strongly to very strongly correlated with374.2. Resultsthe number of mutable locations.Finding 9: Our results show that, different mutation operators generatedifferent number of mutants, which largely depends on the number of mutablelocations present in the source code.Controlling mutant type and quantityTo examine how easy mutants generated by different mutation operatorscan be detected, we control mutant type by sampling subsets of mutantsgenerated by each mutation operator, one at a time. Each subset contains afixed number of mutants. We observe the number of mutants that can bedetected in the subsets by the original test suite. We performed one-wayANOVA tests of equal mean on the number of detectable mutants betweenthe six mutation operators. The F values were 466.9, 944.3, 830.4, 135.5,respectively with p-values close to 0. Thus, we can confidently reject theequal mean hypothesis and conclude that mutants generated by differentmutation operators are not at the same level of easiness to be detected.We further pair-wisely compare the number of detectable mutants betweenthe mutation operators by conducting a Tukey’s Honest Significance Test.The results show some significant and consistent patterns within some of thefault type pairs. Table 4.5 illustrates these patterns. Column diff shows theestimated difference in the number of detectable mutants. Column lwr and uprshow the 95% confidence interval of the difference. The p-values are very closeto zero, which indicate the patterns we found are significant. From the table,we can conclude that mutants generated by Increments Mutator are easierto be detected than mutants generated by Conditionals Boundary Mutator,Negate Conditionals Mutator is easier than Conditionals Boundary Mutator,Return Values Mutator is easier than Conditionals Boundary Mutator, NegateConditionals Mutator is easier than Math Mutator, Math Mutator is harderthan Increments Mutator, and Void Method Calls Mutator is harder thanReturn Values Mutator. All the other pairwise comparisons either showinconsistent or non-significant results.Finding 10: Our results indicate that, there is a significant differencebetween how easy mutants generated by different mutation operators can bedetected.384.2.ResultsMM IM CBM VMC NCM RVMJFreeChart050001000015000cor=0.9402050001000015000MM IM CBM VMC NCM RVMJava-library01000200030004000cor=0.967201000200030004000MM IM CBM VMC NCM RVMLambdaj050010001500cor=0.9801050010001500MM IM CBM VMC NCM RVMAsterisk0500100020003000cor=0.65590500100020003000# of Mutants Detected # of Mutants Generated # of Mutable LocationsFigure 4.1: Plots of number of mutable locations, number of mutants generated, and number ofmutants detected by all of the mutation operators.394.2. ResultsTable 4.4: One-Way ANOVA on the number of mutants killed.Each of the sample subsets of mutants used for the comparisoncontains 10 mutants generated by a specific mutation operator.Df Sum Sq Mean Sq F value Pr(>F)JFreeChartOperator 5 4207 841.4 466.9 <2e-16Residuals 594 1070 1.8Java-libraryOperator 5 5549 1109.7 944.3 <2e-16Residuals 594 698 1.2AsteriskOperator 5 4863 972.6 830.4 <2e-16Residuals 594 696 1.2LambdajOperator 5 1103.7 220.74 135.5 <2e-16Residuals 594 967.6 1.63Omitting mutation operators. We explored whether omitting any singlemutation operator can lead to a significant difference in measuring testeffectiveness.Our statistical analysis shows that omitting a single mutation operatordoes not always cause a significant difference in test effectiveness. Forexample, without using Math Mutator, there exists a significant differencein test effectiveness for subject 1 (p-value equals to 0), but no significantdifference in the rest of the subjects. It is similar for Void Method CallsMutator, Conditionals Boundary Mutator and Negate Conditionals Mutator.However, there always exists a significant loss in test effectiveness if omittingReturn Values Mutator. Without using Increments Mutator and InvertNegatives Mutator, there is no significant change in test effectiveness.Combining this finding with Figure 4.1, which illustrates the distributionof number of mutants generated by the mutation operators, we can see thatIncrements Mutator and Invert Negatives Mutator always generate the leastnumber of mutants in the subjects. The number of neglect mutants may notbe significant enough compared to the total number of mutants generated toinfluence measuring test suite effectiveness. Return Values Mutator generatesthe most number of mutants in Java-library, Lambdaj, and Asterisk, andsecond most number of mutants in JFreeChart. It always generates relativelymore mutants than other mutation operators. Therefore, omitting ReturnValues Mutator is likely to reveal the influence on measuring test effectivenessif there is any difference between mutants generated by Return Values Mutatorand other mutation operators. We observed that omitting Return Values404.2. ResultsTable 4.5: Tukey’s Honest Significance Test on the number of mu-tants killed. Each of the sample subsets of mutants used for thecomparison contains 10 mutants generated by a specific mutationoperator..diff lwr upr p adjIM-CBM4.74 4.197245 5.282755 04.40 3.96167569 4.8383243 06.35 5.9124172 6.78758276 02.36 1.8439404 2.8760596 0NCM-CBM2.90 2.357245 3.442755 04.63 4.19167569 5.0683243 05.18 4.7424172 5.61758276 02.32 1.8039404 2.8360596 0RVM-CBM4.94 4.397245 5.482755 04.92 4.48167569 5.3583243 04.66 4.2224172 5.09758276 02.05 1.5339404 2.5660596 0MM-IM-6.49 -7.032755 -5.947245 0-7.55 -7.98832431 -7.1116757 0-1.77 -2.2075828 -1.33241724 0-3.43 -3.9460596 -2.9139404 0NCM-MM4.65 4.107245 5.192755 07.78 7.34167569 8.2183243 00.60 0.1624172 1.03758276 0.00153.39 2.8739404 3.9060596 0VMC-RVM-5.71 -6.252755 -5.167245 0-0.60 -1.03832431 -0.1616757 0.0014-2.09 -2.6060596 -1.5739404 0Mutator always causes a significant loss in test effectiveness, which potentiallyindicates that Return Values Mutator may generate easy to detect mutants.Intuitively, Return Values Mutator always mutates return statements.Finding 11: Omitting certain mutation operators, such as ‘Return ValuesMutator’, always cause a significant loss in the measured test effectiveness.This finding gives us the insight that there may be a difference in measuringtest effectiveness when mutations are located in different types of statements.4.2.2 Fault Location (RQ2)414.2. ResultsTable 4.6: Tukey’s Honest Significance Test on the number of mu-tants killed between using all mutation operators and omittingeach mutation operator in turn. Each of the sample subsets con-tains 100 mutants either generated by all mutation operators orleaving out each operator in turn.Subject diff lwr upr p adjWithout ReturnValsMutator1 -8.22 -9.4971 -6.9429 02 -3.64 -4.832444 -2.447556 03 -3.89 -5.051185 -2.728815 04 -4.56 -5.881209 -3.238791 0Without MathMutator1 4.25 2.985603 5.514397 02 1.23 0.05409759 2.405902 0.04044243 -0.05 -1.25359 1.15359 0.93479134 -0.67 -1.932948 0.5929484 0.2967606Without VoidMethodCallMutator1 8.15 6.959959 9.340041 02 0.82 -0.3725293 2.012529 0.17664773 1.52 0.3547874 2.685213 0.01082994 13.3 12.18597 14.41403 0Without ConditionalsBoundaryMutator1 0.96 -0.2902678 2.210268 0.13157362 0.97 -0.1547487 2.094749 0.09056993 0.05 -1.137623 1.237623 0.93391654 2.22 1.061811 3.378189 0.0002076Without IncrementsMutator1 -0.88 -2.200685 0.4406852 0.19036782 0.41 0.7976888 1.617689 0.50396733 -1.06 -2.170466 0.0504664 0.06124894 -0.24 -1.386161 0.9061614 0.6801049Without NegateConditionalsMutator1 -7.35 -8.735988 -5.964012 02 0.95 -0.2667233 2.166723 0.12522453 -13.37 -14.60836 -12.13164 04 -3.01 -4.160555 -1.859445 6e-07Without InvertNegsMutator1 -0.69 -1.957531 0.5775308 0.28435422 0.59 -0.5303536 1.710354 0.30030263 -0.35 -1.443749 0.7437485 0.52873794 -0.45 -1.690054 0.7900544 0.475069424.2.ResultsCondition Return StatementJFreeChart0500010000150002000025000MS=0.4560500010000150002000025000Condition Return StatementJava-library0500100015002000 MS=0.7590500100015002000Condition Return StatementLambdaj0100200300400500600 MS=0.7760100200300400500600Condition Return StatementAsterisk01000200030004000 MS=0.70001000200030004000# of Statements # of Mutants Generated # of Mutants DetectedFigure 4.2: Plots of number of different types of statements, number of mutants generated bymutating different types of statements, and number of mutants detected from the generated mutants.434.2. ResultsDistribution of mutants in different types of statements Figure 4.2summarizes the statistics of mutants located in different types of statementsin the source code. In each bar chart, a group of three bars, from left toright, represents (1) number of the type of statements, (2) number of mutantslocated in that type of statements, and (3) number of detected mutants from(2). The top left corner of each bar chart lists the overall mutation score ofthe subject program.The distribution of number of different types of statements varies betweensubjects. For example, there are less return statements than conditionalstatements in JFreeChart, but more return statements than conditionalstatements in the rest three subjects. The distribution of mutants locatedin different types of statements is also subject dependent. However, a largerportion of return and conditional statements is always mutated compared tonormal statements.In addition, the number of mutants generated in conditional statementsis larger than the number of conditional statements in Java-library andLambdaj. The number of mutants generated in return statements is largerthan the number of return statements in Lambdaj as well. The observationindicates that a statement in a program can be mutated more than once andin different ways.Finding 12: Normal statements are less likely to be mutated compared toreturn statements and conditional statements. A program statement may bemutated more than once in different ways.Compare across subjects In Figure 4.3, for each type of statement, themutation score is plotted against the percentage of mutants located in thetype of statement, for each subject program separately. The distribution ofmutants in different types of statements has a strong to very strong correlationwith the mutation score of the subjects. The correlation between the mutationscore and the ratio of mutants is very strong if a program is mutated in returnstatement (0.86) and normal statement (-0.87), and strong if a program ismutated in conditional statement (-0.62). The negative correlation indicatesa relationship between two variables in which one variable increases as theother decreases, and vice versa. For instance, the mutation score will increaseas the ratio of mutants in normal and conditional statement decreases, andvice versa.444.2. Results10 20 30 40 50 60 70 80304050607080Percentage of MutantsMutation Scorecor = 0.86 cor = -0.62cor = -0.87ReturnConditionStatementLambdajJFreeChartAsteriskJava-LibraryFigure 4.3: Plot of mutation score against percentage of mutantsgenerated in different types of statements. For each type of state-ment, a data point represents one of the for subject programs.Finding 13: The percentage of mutants generated by mutating differenttypes of statements is strongly to very strongly correlated with test suiteeffectiveness.Compare within subjectsWe compared the number of detectable mutants in the subsets, by controllingon statement type and mutants quantity. Figure 4.4 shows box-plots of ourresults. From the plots, we can observe a clear decreasing trend of number ofdetectable mutants by the original test suite located in: return statements,conditional statements, and normal statements. This result is consistent withour findings in Section 4.2.1 when comparing between subjects.We performed ANOVA tests of equal mean for the number of detectablemutants located in different type of statements. The F values were 4659.6,454.2. ResultsReturn Condition Statement20406080JFreeChart# of mutants killedReturn Condition Statement607080Java-libraryReturn Condition Statement30507090Asterisk# of mutants killedReturn Condition Statement657585LambdajFigure 4.4: A box-plot of number of mutants killed at differentmutation locations : return statements, condition statements, andnormal statements. Each box represents the 100 subsets of mu-tants that were randomly selected from all mutants generated atthe location for each subject. Each subset contains 100 mutants.241.35, 7936.5 and 195.07 respectively for the subjects with p-value very closeto 0. Thus we can confidently reject the null hypothesis of equal variance (ofnumber of detectable mutants) for the four subjects. Next we applied Tukey’sHonest Significance test of pair-wisely comparing between mutants locatedin different type of statements. The test estimated the pairwise differencebetween number of killed mutants in different type of statements. The testresult is consistent with what we have observed in the box-plot in figure 4.4.464.2. ResultsFinding 14: Our results indicate that, the types of statements mutatedsignificantly influence test suite effectiveness when comparing within eachsubject. Mutants in return statements are easiest to be killed. Mutants incondition statements are easier to be killed then those in normal statements.Table 4.7: Tukey’s Honest Significance Test on the number of mu-tants killed. Each of the sample subsets of mutants used for thecomparison contains 100 mutants.Types diff lwr upr p adjJFreeChartReturn-Condition 20.80 19.38946 22.21054 0Statement-Condition-36.31 -37.72054 -34.89946 0Statement-Return-57.11 -58.52054 -55.69946 0Java-libraryReturn-Condition 4.69 3.427404 5.952596 0Statement-Condition-7.01 -8.272596 -5.747404 0Statement-Return-11.70 -12.962596 -10.437404 0AsteriskReturn-Condition 1.65 0.5488635 2.751137 0.0014Statement-Condition-50.16 -51.2611365 -49.058863 0Statement-Return-51.81 -52.9111365 -50.708863 0LambdajReturn-Condition 7.76 6.552742 8.9672583 0Statement-Condition-1.75 -2.957258 -0.5427417 0.0021Statement-Return-9.51 -10.717258 -8.3027417 0474.2.ResultsTable 4.8: Distribution of mutation location for each mutation operator.Mutation Operators Java-Library AsteriskCon % Ret % Smt % Con % Ret % Smt %ReturnValsMutator 39 3 1143 97 0 0 2 0.8 259 99.2 0 0VoidMethodCallMutator 26 5.2 0 0 475 94.8 2 0.9 0 0 231 99.1NegateConditionalsMutator 763 80.7 53 5.6 129 13.7 402 93.5 13 3.0 15 3.5MathMutator 0 0 2 5.3 36 94.7 14 31.1 4 8.9 27 60.0ConditionalsBoundaryMutator 30 83.3 3 8.3 3 8.3 61 96.8 1 1.6 1 1.6IncrementsMutator 14 82.4 0 0 3 17.6 16 94.1 0 0 1 5.9InvertNegsMutator 0 0 0 0 0 0 0 0 0 0 0 0JFreeChart LambdajReturnValsMutator 18 0.7 2600 98.7 17 0.65 119 17.7 555 82.3 0VoidMethodCallMutator 171 4.8 0 0 3395 95.2 11 8.7 0 0 116 91.3NegateConditionalsMutator 4818 86.1 26 0.5 167 3.3 293 72.5 81 20.0 30 7.4MathMutator 54 3.1 116 6.7 1569 90.2 14 38.9 10 27.8 12 33.3ConditionalsBoundaryMutator 517 93.3 7 1.3 30 5.4 29 85.3 5 14.7 0 0IncrementsMutator 149 86.6 0 0 23 13.4 25 80.6 0 0 6 19.4InvertNegsMutator 5 11.1 7 15.6 33 73.3 0 0 2 100 0 0484.2. ResultsStatement type vs. mutation operator Table 4.8 summarizes thenumber of mutants generated in different types of statements by each mutationoperator. The first column indicates which mutation operator is responsiblefor generating the mutants. Conlumn Con contains number of mutantsgenerated in conditional statements, Ret shows number of mutants generatedin return statements, and Smt presents number of mutants generated innormal statements. Column % represents the ratio of mutants generatedin the type of statements as stated in the previous column by the operator.For example, for subject JFreeChart, 18 mutants (0.7 %) are generated inconditional statements, 2600 (98.7 %) in return statements, and 17 (0.65 %)in normal statements by the Return Values Mutator operator.From the table, we can observe that most of the mutants generated byReturn Values Mutator are in a return statement (82.3–99.2%). Mutantsgenerated by Void Method Calls Mutator are always in normal statements(94.8–99.1%). Negate Conditionals Mutator, Conditionals Boundary Mutator,and Increments Mutator usually mutate conditional statements (81–93.5%).However, Math Mutator and Invert Negatives Mutator do not show a consis-tent pattern across projects. Therefore, five out of the seven mutators studiedin this thesis are able to influence the distribution of mutants generated indifferent types of statements.Finding 15: Our results indicate that, 5/7 of the mutation operatorsstudied correlate with the type of statement mutated.494.2.ResultsTable 4.9: Statistics of mutants at different nesting level.Levels JFreeChart Java-Library Lambdaj AsteriskNstmt Mtotal Mkilled MS Nstmt Mtotal Mkilled MS Nstmt Mtotal Mkilled MS Nstmt Mtotal Mkilled MS1 24810 8894 4281 48.1 3892 1384 1101 79.6 811 698 545 78.1 6465 739 486 65.82 10970 3279 1627 49.6 979 878 582 66.3 318 180 136 75.6 1168 217 169 77.93 3391 979 286 29.2 175 121 106 87.6 112 27 16 59.3 317 59 48 81.4>=4 2121 569 75 13.2 43 335 273 81.5 118 28 20 71.4 142 34 31 91.2Table 4.10: Statistics of explicit detectable mutants at different nesting level.Levels JFreeChart Lambdaj Java-Library Asteriskkilled explicit % killed explicit % killed explicit % killed explicit %1 4281 2814 65.7 355 197 55.5 1101 641 58.2 486 382 78.62 1627 1337 82.2 169 71 42.0 582 168 28.9 169 140 82.83 286 209 73.1 389 247 63.5 106 42 39.6 48 40 83.3>=4 75 40 53.3 102 49 48.0 273 89 32.6 31 22 71.0504.2. ResultsNesting levels Table 4.9 shows the statistics of statements and mutantsat different nested levels. Column Nstmt shows total number of statementsat different nested levels, column Mtotal indicates total number of mutantsgenerated at each level, column Mkilled shows the number of detectablemutants at that level, and column MS calculates the mutation score ofthe original test suite if only consider the mutants generated at the level.The mutation score decreases as nested level increases from 48.1% to 13.2%for JFreeChart, but increases from 65.8 % to 91.2 % in Asterisk. There isnot a clear increasing/decreasing trend in mutation score as nested levelgoes up in Java-library and Lambdaj. Therefore, we did not observe anycorrelation between the nested level of a mutant and how easy/hard it canbe detected. Table 4.10 summaries the distribution of explicitly detectedmutants by assertions out of all detected mutants. There is no correlationbetween the level of nesting and the ratio of explicitly detected mutants.This might be due to the presence of dedicated test assertions that testerswrite for nested statements. There is thus no evidence that testers pay lessattention to deeper nested statements.Finding 16: Our results indicate that, there is no correlation betweenhow easy a mutant can be killed and the depth of nesting of the mutatedstatement.51Chapter 5Discussion5.1 Test Suite Size vs. Assertion QuantityFrom the findings 1 and 2, the number of assertions in a test suite is verystrongly correlated with its effectiveness with or without controlling for theinfluence of test size. However, according to finding 3, if we in turn control forthe number of assertions, a test suite’s effectiveness at a same test size levelcan be directly influenced by the number of assertions it contains. Thus, testsuite size is not sufficient in predicting the effectiveness without consideringthe influence of assertion quantity. In addition, assertion quantity providesextra indications about the suite’s explicit mutation score, which constitutesa large portion of the mutation score. Therefore, test suite size can predictthe effectiveness only under the assumption that there is a linear relationshipbetween the number of test methods and the number of assertions in thetest suite. We believe this is an interesting finding, which explains whyprevious studies [28] have found a strong correlation between suite size andeffectiveness.5.2 Implicit vs. Explicit Mutation ScoreWe noticed an interesting phenomenon, namely, that mutants that are implic-itly detectable can also be detected by assertions, if the mutated statementfalls in the coverage of the assertion. However, mutants that are explicitlydetectable by assertions can never be detected by non-assertion statementsof the tests. This is because explicitly detectable mutants cannot be detectedby simply executing the mutated part of a program; i.e., a specific assertionstatement is required to catch the program’s unexpected behaviour. This isdue to the fact that explicitly detectable mutants inject logical faults into aprogram that lead to a contradiction with the programmers’ expectations.From our observations, more than half of all detectable mutants (28%–73%)are explicitly detected by assertions in a test suite; and therefore assertionsstrongly influence test suite effectiveness. If we only focus on explicitly de-525.3. Statement vs. Assertion Coveragetectable mutants, then test assertions are the only means to achieve suiteeffectiveness. This might also explain why statement coverage achieves arelatively low correlation with explicit mutation score.5.3 Statement vs. Assertion CoverageFrom findings 4 and 5, assertion coverage is a good estimator of both mutationscore and explicit mutation score. If the influence of assertion coverage iscontrolled, there is a passable level of correlation between statement coverageand mutation score, while only a weak correlation between statement coverageand explicit mutation score. Therefore, statement coverage is a valid estimatorof mutation score only under the assumption that not all of generated mutantsare explicitly detectable mutants. In other words, statement coverage is notan adequate metric of logical-fault detection ability. Statement coverageincludes more statements than assertion coverage from source code, withoutproviding any extra insights for predicting test suite effectiveness. Comparedwith statement coverage, assertion coverage is very strongly correlated withthe effectiveness regardless of the distribution of implicitly or explicitlydetectable mutants. Our results suggest that testers should aim at increasingthe assertion coverage of their test suite instead of its statement coverage,when trying to improve a test suite’s effectiveness.5.4 Assertion TypeFindings 8 and 7 indicate that there is no consistent ranking nor a significantdifference between the effectiveness of different types of assertions. Wemanually assessed the assertions in the sample test suites and found thatthe assertions of one type can be easily interpreted as another type. Forexample, assertEquals/Not can be easily interpreted as assertTrue/False:assertEquals(A, B) can also be written as assertTrue(A.equals(B)), andassertNotEquals(A, B) can also be written as assertFalse(A.equals(B)).For assertEquals/Not(A, B), the assertion content type is the type of A andB, whereas the assertion content type of assertTrue/False(A.equals(B)) isboolean, with effectiveness of the assertion stays the same. assertNull/Notcan be interpreted as assertTrue/False as well: assertNull(A) can bewritten as assertTrue(A==null), where the assertion content type changesfrom the type of A to boolean. Therefore, the way we classify assertions isnot able to distinguish them into disjoint sets. Assertions should be classifiedaccording to a more fine-grained methodology to measure their impact on535.5. Distribution of Mutantstest effectiveness.5.5 Distribution of MutantsFinding 9 shows that the number of mutants generated by different mutationoperators is not distributed evenly. Their distributions are strongly correlatedwith the distribution of mutable locations of the mutators in the sourcecode. Finding 15 tells us that mutants generated by 5/7 mutation operatorsare strongly correlated with program statement types. In other word, thedistribution of mutants is strongly correlated with the code characteristicsof a program. Therefore, we believe, in the context of mutation testing,researchers and tester should always consider (and report) the characteristicstheir subjects, such as the distribution of different statement types.5.6 Mutation SelectionFindings 10 and 11 indicate that there is always a significant differencebetween how easy mutants generated by different mutators can be killed.And, omitting Return Values Mutator always influences measuring test suiteeffectiveness. There are many influencing factors of whether omitting amutation operator will cause a significant change in test suite effectiveness.One reason can be, mutants generated by the operator are easy/hard todetect mutants compare to all of the mutants generated for the subject. Forexample, as suggested by finding 14, mutants generated by Return ValuesMutator are easy to detect since most of them are in return statements. Inaddition, the total number of mutants generated by a mutation operator isalso important. Omitting a small number of mutants may not be significantenough to reveal the difference. Therefore, the two factors should be carefullyconsidered before omitting any mutation operator when assessing test suiteeffectiveness through mutation testing. For example, if there is a big portionof return statements in the program, Return Values Mutator is likely togenerate a large number of easy-to-detect mutants in return statements, andomitting the operator is likely to influence measuring test suite effectiveness.5.7 Statement TypeFrom findings 13 and 14 we learn that the type of statement where a programis mutated significantly influences how easy a mutant is detected. Mutantsin return statements are easiest to be killed. We believe this is because545.8. Threats to Validityreturn statements are the statements where values propagate and are passedbetween methods. Since tests target the return statements of methods, errorsin return statements are easiest to be revealed. Mutants in conditionalstatements are harder to be detected compared to return statements. Faultsin conditional statements can change the control flow of a program, thereforerequiring a specific test to be revealed. From finding 12, PIT is more likely tomutate return statements and conditional statements than normal statements.Therefore, PIT may overestimate a test suite’s effectiveness, since mutants innormal statements are hardest to be detected. This is important to considerwhen measuring test suite effectiveness.5.8 Threats to ValidityInternal validity: To conduct the controlled experiments, we made useof many existing tools, such as PIT [10], Clover [3], and JavaSlicer [7]. Weassumed these tools are able to produce valid results. Therefore, any erroneousbehaviours of these tools might introduce unknown factors to the validity ofour results. To mitigate such factors as much as possible, we tested our owncode that uses these external tools.Similar to previous studies [28], we treated mutants that cannot bedetected by the master test suites as equivalent mutants, which might over-estimate the number of equivalent mutants. However, since we are mainlyconcerned with the correlations between mutation/explicit mutation scoreand the other metrics, subtracting a constant value from the total number ofmutants generated, will not impact the correlations.External validity: We studied the relationship between assertions andtest suite effectiveness using more than 24,000 assertions collected from fiveopen source Java programs. However, programs written in Java may notbe representative of the programs written in other languages. Thus, ourresults might not extend to other languages. Moreover, the assertions weexamined in this thesis are JUnit4 assertions, and our results may not applyto assertions used in other testing frameworks. We mainly looked at the 9,177assertions for JFreeChart [8] when comparing the effectiveness of differentassertion types. Although the set of assertions used is large and writtenby real developers, our findings may not generalize to other programs. Inaddition, we used PIT to conduct mutation testing; PIT stops executingonce a test assertion detects a mutant. However, it is helpful to know all theassertions that would fail when studying assertion types. We mainly looked555.8. Threats to Validityat mutants that are generated by the seven default mutation operators ofPIT. Although the mutants generated have covered all types of programstatements (return, conditional, and normal statements), they may not berepresentative of all existing types of program faults. We used Randoop togenerate test cases with assertions, which were compared to human-writtenassertions. However, there also exist other test oracle generation strategiesthan feedback-directed random test generation, and using a different testgeneration strategy might influence the results. We used Randoop, becauseof its relative ease of use. .Construct validity: The mutants we used in this thesis are generatedby using PIT [10]. When assessing the distribution of different types offaults and faults in different types of statements, the implementation decisionof the tool can influences what we observe. Utilizing a different mutationtesting tool may influence the results. We used PIT because it is the mostpopular Java mutation testing tool, which has also been widely adopted inthe literature, such as [26, 28, 37, 39, 44].Our empirical data as well as the five subject programs are all availableonline, making our study repeatable.56Chapter 6Conclusions and Future WorkIn this thesis, we studied the influence of assertions and mutants on mea-suring test suite effectiveness. First, we examined the correlation betweenassertion quantity and the effectiveness, and further analyzed the influenceof assertion quantity on the correlation between test suite size and the effec-tiveness. Second, we investigated the relationship between assertion coverageand suite effectiveness, and explored the impact of assertion coverage on therelation between statement coverage and effectiveness. Third, we comparedthe effectiveness of different assertion characteristics. Fourth, we investigateddifferent types of faults (simulated by mutants generated by different mutationoperators). We explored the distribution of mutants generated by differentmutation operators, compared different types of mutants in terms of testsuite effectiveness, and studied the influence of omitting each single mutationoperator, separately. Finally, we investigated faults located in different typesof program statements. We observed the distribution of mutants generatedin different types of program statements, compared mutants generated indifferent types of statements in terms of test effectiveness, studied the re-lationship between types of statements and operators, and the influence ofdepth of nesting of program statements and the effectiveness. Based on ananalysis of over 24,000 assertions collected from five cross-domain real-worldJava programs, we found that:• There is a very strong correlation between the number of assertionsand test suite effectiveness, with or without controlling for the numberof test methods in the test suite. Thus, the number of assertions in atest suite can significantly influence the prediction power of test suitesize for the effectiveness.• There is a very strong correlation between assertion coverage andtest suite effectiveness. With assertion coverage controlled for, thereis a moderate to strong correlation between statement coverage andmutation score, and only a weak to moderate correlation betweenstatement coverage and explicit mutation score. Therefore, statementcoverage is an adequate metric of test suite effectiveness only under57Chapter 6. Conclusions and Future Workthe assumption that the faults to be detected are not only explicitlydetectable, while assertion coverage is a good estimator of test suiteeffectiveness without such a constraint.• Types of assertions can influence the effectiveness of their containingtest suites: human-written assertions are more effective than generatedassertions.• There is no evidence to support that assertion content types andassertion method types significantly influence test suite effectiveness.• There is a significant difference between mutants generated by differentmutation operators in terms of test effectiveness. The distribution ofdifferent types of mutants is strongly correlated with the distribution ofmutable locations of the operators. Omitting a single mutation operatordoes not always significantly influence test suite effectiveness. Fromour observations, omitting Return Val Mutator always leads to a lossin test suite effectiveness, Omitting Increments Mutator and InvertNegatives Mutator can not introduce a significant difference in testsuite effectiveness.• There is a significant difference between mutants located in differenttypes of statements, in terms of how easy they can be killed. We foundthat the distribution of mutants in different types of statements stronglycorrelates with test suite effectiveness. Mutants generated in returnstatements are easiest to be detected. Mutants generated in conditionalstatements are easier to be detected than mutants generated in normalstatements. Five out of seven mutation operators studied in this thesisare correlated with mutation location.Our results indicate that it might be sufficient to use the assertion quantityand assertion coverage as criteria to measure a suite’s adequacy, since thesetwo metrics are at least as good as suite size and statement coverage. Inaddition, fault type and fault location significantly influence measuring testsuite effectiveness. To precisely assess a test suite’s effectiveness, it is essentialto describe the set of faults used by indicating the distribution of fault typesand locations.For future work, we would like to conduct experiments using more pro-grams to further validate our findings. We plan to include more mutationoperators in our studies in the future to improve generalization of the findings.We did not find a significant difference between different assertion types sincedifferent types of assertions can sometimes be difficult to distinguish onefrom another. To obtain a more fine-grained analysis, assertions should beclassified more precisely as disjoint groups. For example, assertEquals(A, B)58Chapter 6. Conclusions and Future Workand assertTrue(A.equals(B)) should always be classified as one type insteadof two. Moreover, we will conduct a taxonomy on assertions to more preciselydifferentiate their characteristics.59Bibliography[1] Apache commons lang. http://commons.apache.org/proper/commons-lang/. Accessed: 2014-10-30.[2] Asterisk-java. https://blogs.reucon.com/asterisk-java/. Accessed:2014-10-30.[3] Clover. https://www.atlassian.com/software/clover/overview/.Accessed: 2014-10-01.[4] ErrorCollector. http://junit.org/junit4/javadoc/4.12/org/junit/rules/ErrorCollector.html. Accessed: 2015-09-30.[5] Github. https://github.com/. Accessed: 2014-09-05.[6] JavaParser. https://code.google.com/p/javaparser/. Accessed:2014-09-05.[7] JavaSlicer. https://www.st.cs.uni-saarland.de/javaslicer/. Ac-cessed: 2014-10-15.[8] JFreeChart. http://www.jfree.org/jfreechart/. Accessed: 2014-09-13.[9] Lambdaj. https://code.google.com/p/lambdaj/. Accessed: 2014-10-30.[10] Pit. http://pitest.org. Accessed: 2014-10-07.[11] SLOCCount. http://www.dwheeler.com/sloccount/. Accessed: 2014-12-15.[12] Urban airship java library. http://docs.urbanairship.com/reference/libraries/java/. Accessed: 2014-10-30.[13] Paul Ammann, Marcio Eduardo Delamaro, and Jeff Offutt. Establishingtheoretical minimal sets of mutants. In Proceedings of the 2014 IEEE In-ternational Conference on Software Testing, Verification, and Validation,ICST ’14, pages 21–30. IEEE Computer Society, 2014.60Bibliography[14] J. H. Andrews, L. C. Briand, and Y. Labiche. Is mutation an appro-priate tool for testing experiments? In Proceedings of the InternationalConference on Software Engineering, ICSE, pages 402–411. ACM, 2005.[15] James H. Andrews, Lionel C. Briand, Yvan Labiche, and Akbar SiamiNamin. Using mutation analysis for assessing and comparing testingcoverage criteria. IEEE Trans. Softw. Eng., 32:608–624, 2006.[16] Ellen Francine Barbosa, José Carlos Maldonado, and Auri Marcelo RizzoVincenzi. Toward the determination of sufficient mutant operators forc? Software Testing, Verification and Reliability, 11(2), 2001.[17] Lionel Briand and Dietmar Pfahl. Using simulation for assessing thereal impact of test coverage on defect coverage. In Proceedings of theInternational Symposium on Software Reliability Engineering, ISSRE,pages 148–157. IEEE Computer Society, 1999.[18] Xia Cai and Michael R. Lyu. The effect of code coverage on fault detectionunder different testing profiles. In Proceedings of the InternationalWorkshop on Advances in Model-based Testing, A-MOST, pages 1–7.ACM, 2005.[19] Murial Daran and Pascale Thévenod-Fosse. Software error analysis: Areal case study involving real faults and mutations. In Proceedings ofthe ACM SIGSOFT International Symposium on Software Testing andAnalysis, ISSTA, pages 158–171. ACM, 1996.[20] Marcio Eduardo Delamaro, Lin Deng, Vinicius Humberto SerapilhaDurelli, Nan Li, and Jeff Offutt. Experimental evaluation of sdl andone-op mutation for c. In Proceedings of the 2014 IEEE InternationalConference on Software Testing, Verification, and Validation, ICST ’14,pages 203–212. IEEE Computer Society, 2014.[21] Marcio Eduardo Delamaro, Lin Deng, Nan Li, Vinicius Durelli, and JeffOffutt. Growing a reduced set of mutation operators. In Proceedings ofthe 2014 Ninth International Conference on Availability, Reliability andSecurity, ARES ’14, pages 81–90. IEEE Computer Society, 2014.[22] Lin Deng, Jeff Offutt, and Nan Li. Empirical evaluation of the statementdeletion mutation operator. In Proceedings of the 2013 IEEE Sixth In-ternational Conference on Software Testing, Verification and Validation,ICST ’13, pages 84–93. IEEE Computer Society, 2013.61Bibliography[23] P. G. Frankl and S. N. Weiss. An experimental comparison of theeffectiveness of branch testing and data flow testing. IEEE Trans. Softw.Eng., 19:774–787, 1993.[24] Phyllis G. Frankl and Oleg Iakounenko. Further empirical studies of testeffectiveness. In Proceedings of the 6th ACM SIGSOFT InternationalSymposium on Foundations of Software Engineering, FSE, pages 153–162.ACM, 1998.[25] Phyllis G. Frankl and Stewart N. Weiss. An experimental comparisonof the effectiveness of the all-uses and all-edges adequacy criteria. InProceedings of the Symposium on Testing, Analysis, and Verification,TAV4, pages 154–164. ACM, 1991.[26] Rahul Gopinath, Carlos Jensen, and Alex Groce. Code coverage for suiteevaluation by developers. In Proceedings of the International Conferenceon Software Engineering, ICSE, pages 72–82. ACM, 2014.[27] Dan Hao, Lu Zhang, Xingxia Wu, Hong Mei, and Gregg Rothermel.On-demand test suite reduction. In Proceedings of the 34th InternationalConference on Software Engineering, ICSE ’12, pages 738–748. IEEEPress, 2012.[28] Laura Inozemtseva and Reid Holmes. Coverage is not strongly corre-lated with test suite effectiveness. In Proceedings of the InternationalConference on Software Engineering, ICSE, pages 435–445. ACM, 2014.[29] René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, ReidHolmes, and Gordon Fraser. Are mutants a valid substitute for real faultsin software testing? In Proceedings of the ACM SIGSOFT InternationalSymposium on Foundations of Software Engineering, FSE, pages 654–665.ACM, 2014.[30] Elfurjani S Mresa and Leonardo Bottaci. Efficiency of mutation operatorsand selective mutation strategies: An empirical study. Software TestingVerification and Reliability, 9:205–232, 1999.[31] Akbar Siami Namin and James H. Andrews. Finding sufficient mutationoperators via variable reduction. In Proceedings of the Second Workshopon Mutation Analysis, MUTATION ’06, pages 5–. IEEE ComputerSociety, 2006.62Bibliography[32] Akbar Siami Namin and James H. Andrews. On sufficiency of mutants.In Companion to the Proceedings of the 29th International Conferenceon Software Engineering, ICSE COMPANION ’07, pages 73–74. IEEEComputer Society, 2007.[33] Akbar Siami Namin and James H. Andrews. The influence of size andcoverage on test suite effectiveness. In Proceedings of the InternationalSymposium on Software Testing and Analysis, ISSTA, pages 57–68. ACM,2009.[34] A. Jefferson Offutt, Ammei Lee, Gregg Rothermel, Roland H. Untch,and Christian Zapf. An experimental determination of sufficient mutantoperators. ACM Trans. Softw. Eng. Methodol., 5:99–118, 1996.[35] A. Jefferson Offutt, Gregg Rothermel, and Christian Zapf. An exper-imental evaluation of selective mutation. In Proceedings of the 15thInternational Conference on Software Engineering, ICSE ’93, pages100–107. IEEE Computer Society Press, 1993.[36] Carlos Pacheco, Shuvendu K Lahiri, Michael D Ernst, and ThomasBall. Feedback-directed random test generation. In Proceedings of theInternational Conference on Software Engineering (ICSE), pages 75–84.IEEE Computer Society, 2007.[37] Sebastiano Panichella, Annibale Panichella, Moritz Beller, Andy Zaid-man, and Harald C. Gall. The impact of test case summaries on bugfixing performance: An empirical investigation. In Proceedings of the38th International Conference on Software Engineering, ICSE ’16, pages547–558. ACM, 2016.[38] David Schuler and Andreas Zeller. Assessing oracle quality with checkedcoverage. In Proceedings of the International Conference on SoftwareTesting, Verification and Validation, ICST, pages 90–99. IEEE ComputerSociety, 2011.[39] August Shi, Tifany Yung, Alex Gyori, and Darko Marinov. Comparingand combining test-suite reduction and regression test selection. InProceedings of the 2015 10th Joint Meeting on Foundations of SoftwareEngineering, ESEC/FSE 2015, pages 237–247. ACM, 2015.[40] Akbar Siami Namin, James H. Andrews, and Duncan J. Murdoch. Suffi-cient mutation operators for measuring test effectiveness. In Proceedings63Bibliographyof the 30th International Conference on Software Engineering, ICSE ’08,pages 351–360. ACM, 2008.[41] Roland H. Untch. On reduced neighborhood mutation analysis using asingle mutagenic operator. In Proceedings of the 47th Annual SoutheastRegional Conference, ACM-SE 47, pages 71:1–71:4. ACM, 2009.[42] W. Eric Wong and Aditya P. Mathur. Reducing the cost of mutationtesting: An empirical study. J. Syst. Softw., 31:185–196, 1995.[43] Lu Zhang, Shan-Shan Hou, Jun-Jue Hu, Tao Xie, and Hong Mei. Isoperator-based mutant selection superior to random mutant selection?In Proceedings of the 32Nd ACM/IEEE International Conference onSoftware Engineering - Volume 1, ICSE ’10, pages 435–444. ACM, 2010.[44] Yucheng Zhang and Ali Mesbah. Assertions are strongly correlated withtest suite effectiveness. In Proceedings of the 2015 10th Joint Meeting onFoundations of Software Engineering, ESEC/FSE, pages 214–224. ACM,2015.64

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0319084/manifest

Comment

Related Items