If Journals Embraced Conditional Equivalence Testing,Would Research be Better?byHarlan CampbellB.Sc. McGill University, 2008M.Sc. Simon Fraser University, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Statistics)The University of British Columbia(Vancouver)June 2019c Harlan Campbell, 2019The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:If Journals Embraced Conditional Equivalence Testing,Would Researchbe Better?submitted by Harlan Campbell in partial fulfillment of the requirements for thedegree of Doctor of Philosophy in Statistics.Examining Committee:Dr. Paul Gustafson, StatisticsSupervisorDr. Will Welch, StatisticsSupervisory Committee MemberDr. John Petkau, StatisticsSupervisory Committee MemberDr. Lang Wu, Statistics, StatisticsUniversity ExaminerDr. Victoria Savalei, PsychologyUniversity ExaminerDr. Blake McShane, Kellogg School of Management, Northwestern UniversityExternal ExamineriiAbstractWe consider the reliability of published science: the probability that scientificclaims put forth are true. Low reliability within many scientific fields is of ma-jor concern for researchers, scientific journals and the public at large. In the firstpart of this thesis, we introduce a publication policy that incorporates “conditionalequivalence testing” (CET), a two-stage testing scheme in which standard null-hypothesis significance testing is followed, if the null hypothesis is not rejected,by testing for equivalence. The idea of CET has the potential to address recentconcerns about reproducibility and the limited publication of null results. We de-tail the implementation of CET, investigate similarities with a Bayesian testingscheme, and outline the basis for how a scientific journal could proceed to reducepublication bias while remaining relevant.In the second part of this thesis, we consider proposals to adopt measures of“greater statistical stringency,” including suggestions to require larger sample sizesand to lower the highly criticized “p < 0.05” significance threshold. While prosand cons are vigorously debated, there has been little to no modeling of how adopt-ing these measures might affect what type of science is published. We develop anovel model that, given current incentives to publish, predicts a researcher’s mostrational use of resources in terms of the number of studies to undertake, the sta-tistical power to devote to each study, and the desirable pre-study odds to pursue.Using this model, we investigate the merits of adopting measures of “greater sta-tistical stringency” with the goal of informing the ongoing debate. We also usethis model to investigate the merits of alternative publication policies, includingthe registered reports policy and our novel CET publication policy.iiiLay SummaryMotivated by recent concerns with the reproducibility and reliability of scientificresearch, we introduce a publication policy that incorporates “conditional equiva-lence testing” (CET), a two-stage testing scheme that combines standard null hy-pothesis significance testing and equivalence testing. We explain how such a policycould address issues of publication bias. We then develop a model that, given cur-rent incentives to publish, predicts a researcher’s most rational use of resources.Using this model, we are able to determine whether a given policy, such as ourCET policy, can incentivize more reliable and reproducible research. We concludethat novel publication policies, such as the Registered Reports policy and our pro-posed CET policy, have the potential to better align scientists’ incentives with thegoal of publishing reliable science.ivPrefaceThis thesis was completed under the supervision of Professor Paul Gustafson at theUniversity of British Columbia. The research ideas were conceived and developedduring my time studying on UBC’s Point Grey campus, and also during manysunny afternoons on beautiful Saltspring Island and many snowy days in the westkootenay town of Rossland.A version of Chapter 2 has been published in the journal PloS One [Campbell,H., Gustafson, P. (2018). Conditional equivalence testing: An alternative remedyfor publication bias. PloS one, 13(4), e0195145]. As first author, I am responsiblefor the conceptualization of the main ideas, a review of the relevant literature, thesimulation methodology, and writing the original draft manuscript and subsequentrevisions. Dr Gustafson supervised the project, helped further develop the mainideas and methods, and contributed to writing and editing. This work was presentedby me at an invited session during the Statistical Society of Canada meeting atMcGill University in 2018.A version of Chapter 3 has been published as in a special issue of the peer-reviewed journal The American Statistician [Campbell, H., Gustafson, P. (2019).The World of Research Has Gone Berserk: Modeling the Consequences of Re-quiring “Greater Statistical Stringency” for Scientific Publication. The AmericanStatistician, 73(sup1), 358-373]. As first author, I am responsible for the concep-tualization of the main ideas, a review of the relevant literature, developing themethodology, and writing the original draft manuscript and subsequent revisions.Dr Gustafson supervised the project, worked to refine the proposed methods, andcontributed to writing and editing.I am the lead author for the work presented in Chapter 4 and am responsiblevfor all concept formation, analysis, and manuscript composition. Dr Gustafson wasthe supervisory author for this work and was involved throughout.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiList of Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 A Conditional Equivalence Testing Publication Policy . . . . . . . . 62.1 An overview of Conditional Equivalence Testing . . . . . . . . . 72.1.1 Defining the equivalence margin . . . . . . . . . . . . . . 122.2 CET operating characteristics and sample size calculations. . . . . 132.3 A comparison with Bayesian testing . . . . . . . . . . . . . . . . 172.3.1 Simulation Study . . . . . . . . . . . . . . . . . . . . . . 202.4 A CET publication policy . . . . . . . . . . . . . . . . . . . . . . 22vii2.4.1 Outline of a CET-based publication policy . . . . . . . . . 242.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 The world of research has gone berserk: modeling the consequencesof requiring “greater statistical stringency” for scientific publication 373.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.1.1 Statistical power . . . . . . . . . . . . . . . . . . . . . . 413.1.2 Pre-study probability . . . . . . . . . . . . . . . . . . . . 423.1.3 Model framework . . . . . . . . . . . . . . . . . . . . . . 443.1.4 Ecosystem metrics . . . . . . . . . . . . . . . . . . . . . 483.1.5 Model calibration . . . . . . . . . . . . . . . . . . . . . . 493.2 The effects of adopting lower significance thresholds . . . . . . . 523.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.3 The effects of strict a priori power calculation requirements . . . . 563.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.3.2 In tandem: The effects of adopting both a lower signifi-cance threshold and a power requirement . . . . . . . . . 593.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.5 Chapter 3 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . 623.5.1 Distributions: further details . . . . . . . . . . . . . . . . 623.5.2 Ecosystem metrics: further details . . . . . . . . . . . . . 634 To make more published research true, should scientists adopt alter-natives to null hypothesis significance testing? . . . . . . . . . . . . . 824.0.1 Average sample size of published studies: . . . . . . . . . 844.1 A model for Registered Reports . . . . . . . . . . . . . . . . . . 864.1.1 The RR policy relative to the SP0 policy . . . . . . . . . . 884.1.2 The RR policy relative to SP1 and SP2- The effects ofadopting a RR policy relative to adopting policies of greaterstatistical stringency . . . . . . . . . . . . . . . . . . . . 914.1.3 Concluding thoughts on RR . . . . . . . . . . . . . . . . 944.2 A model for Bayesian Registered Reports . . . . . . . . . . . . . 964.2.1 The BF-RR policy relative to SP0 . . . . . . . . . . . . . 98viii4.2.2 The BF-RR policy relative to SP1 and SP2 . . . . . . . . 994.2.3 Concluding thoughts on Bayesian RR . . . . . . . . . . . 1014.3 A model for the Conditional Equivalence Testing policy . . . . . . 1034.3.1 A standard CET policy . . . . . . . . . . . . . . . . . . . 1054.3.2 A CET policy with a more generous equivalence margin . 1104.3.3 A CET policy with a lower a1 significance threshold . . . 1124.3.4 A CET policy with a SSR . . . . . . . . . . . . . . . . . 1144.3.5 Concluding thoughts on the CET policy . . . . . . . . . . 1184.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1194.5 Chapter 4 - Additional figures . . . . . . . . . . . . . . . . . . . 1215 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1365.1 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . 1395.2 Limitations of the optimality model . . . . . . . . . . . . . . . . 140Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142ixList of TablesTable 3.1 Equations for the expected number of studies (out of a total ofnS studies) for each of the eight categories considered. . . . . . 43Table 3.2 For ecosystems defined by fixed a = 0.05, A= (1 psp)m, andB= 0, and varying values of k,m, and d , the table lists estimatesfor the metrics of interest. . . . . . . . . . . . . . . . . . . . . 52Table 3.3 For ecosystems defined by fixed a = 0.05, A= (1 psp)m, andB = 0, and varying values of k, m, and d , the table lists esti-mates for the metrics of interest related to the accuracy of thepublished effects sizes. . . . . . . . . . . . . . . . . . . . . . 53Table 3.4 A comparison between ecosystems with a = 0.05 and ecosys-tems with a = 0.005. . . . . . . . . . . . . . . . . . . . . . . 54Table 3.5 A comparison between ecosystems with a sample size require-ment and ecosystems without. . . . . . . . . . . . . . . . . . . 58Table 3.6 A comparison between ecosystems with both a sample size re-quirement and a = 0.005 and ecosystems without a sample sizerequirement and a = 0.05. . . . . . . . . . . . . . . . . . . . 60Table 4.1 Equations for the expected number of studies (out of a total ofnS studies) for each of the twelve categories considered. . . . . 84Table 4.2 For ecosystems defined by a RR policy and varying values of k,m, and d , the table lists estimates for the metrics of interest. . . 91Table 4.3 A comparison between ecosystems with a RR policy and ecosys-tems with the SP0 policy. . . . . . . . . . . . . . . . . . . . . 92xTable 4.4 A comparison between ecosystems with a RR policy and ecosys-tems with the SP1 policy. . . . . . . . . . . . . . . . . . . . . 94Table 4.5 A comparison between ecosystems with a RR policy and ecosys-tems with the SP2 policy. . . . . . . . . . . . . . . . . . . . . 95Table 4.6 A comparison between ecosystems with a BF-RR policy andecosystems with the SP0 policy. . . . . . . . . . . . . . . . . . 100Table 4.7 A comparison between ecosystems with a BF-RR policy andecosystems with the SP1 policy. . . . . . . . . . . . . . . . . . 101Table 4.8 A comparison between ecosystems with a BF-RR policy andecosystems with the SP2 policy. . . . . . . . . . . . . . . . . . 102Table 4.9 A comparison between ecosystems with the CET1 policy andecosystems with the RR policy. . . . . . . . . . . . . . . . . . 106Table 4.10 For ecosystems defined by the CET1 policy and varying valuesof k, m, and d , the table lists estimates for the metrics of interest. 107Table 4.11 Metrics of interest for a CET policy with a more generous equiv-alence margin. . . . . . . . . . . . . . . . . . . . . . . . . . . 110Table 4.12 Metrics of interest for a CET policy with a lower a1 significancethreshold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113Table 4.13 Metrics of interest for a CET policy with a sample size require-ment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114Table 4.14 Metrics of interest to compare the CET-SSR compared to theRR policy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117xiList of FiguresFigure 2.1 CET point estimates, confidence intervals, and the correspond-ing rejection regions. . . . . . . . . . . . . . . . . . . . . . . 28Figure 2.2 Comparing NHST p-values with pCET values across a widerange of effect sizes for various sample sizes. . . . . . . . . . 29Figure 2.3 The required sample size for 90% power, or Pr(Success) = 61%. 30Figure 2.4 A comparison between the JZS-Bayes Factor and the pCET . . 31Figure 2.5 Simulation study results: conclusions, with BF threshold of 3:1. 32Figure 2.6 Simulation study results: levels of agreement, for BF thresholdof 3:1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Figure 2.7 Simulation study results: conclusions, with BF threshold of 6:1. 34Figure 2.8 Simulation study results: levels of agreement, BF threshold of6:1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35Figure 2.9 Outline of the RR and CET publication policies. . . . . . . . 36Figure 3.1 Nine gridded squares represent nine different research strategies. 69Figure 3.2 The expected number of publications, ENP, for different val-ues of PSP and PWR. . . . . . . . . . . . . . . . . . . . . . . 70Figure 3.3 The density of published papers, fPUB(psp, pwr), for the fourpublication policies of interest. . . . . . . . . . . . . . . . . . 71Figure 3.4 Values of DSCV (Breakthrough Discoveries) for varying val-ues of k, m and a . . . . . . . . . . . . . . . . . . . . . . . . . 72Figure 3.5 Values of µˆdnegPUB (the proportion of published effect sizes thatare negative) for varying values of k, m and a . . . . . . . . . . 73xiiFigure 3.6 Values of |µˆd |PUB (the average of the absolute published effectsize) for varying values of k, m and a . . . . . . . . . . . . . . 74Figure 3.7 Values of REL (the reliability) for varying values of k, m and a . 75Figure 3.8 Values of PubR (the publication rate) for varying values of k,m and a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76Figure 3.9 Values ofDREL (the reliability of breakthrough discoveries) forvarying values of k, m and a . . . . . . . . . . . . . . . . . . . 77Figure 3.10 Values of MpwrPUB (the median power of published studies)for varying values of k, m and a . . . . . . . . . . . . . . . . . 78Figure 3.11 Values of IQpspPUB (the interquartile range of the distributionof psp values amongst published studies) for varying values ofk, m and a . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79Figure 3.12 Values of the expected Type M error for varying values of k, mand a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80Figure 3.13 Values of the Type S error rate for varying values of k, m and a . 81Figure 4.1 The expected number of publications, ENP, for different val-ues of psp and pwr under a RR policy. . . . . . . . . . . . . . 89Figure 4.2 The expected number of publications, ENP, for different val-ues of psp and pwr under a CET policy with d = 0.20, D =0.20, a1 = 0.05. . . . . . . . . . . . . . . . . . . . . . . . . 109Figure 4.3 The number of expected publications, ENP, for different val-ues of PSP and PWR under a CET policy with µd = 0.20,D= 0.30, a1 = 0.05. . . . . . . . . . . . . . . . . . . . . . . 111Figure 4.4 The number of expected publications, ENP, for different val-ues of PSP and PWR under a CET policy with µd = 0.20,D= 0.20, a1 = 0.005. . . . . . . . . . . . . . . . . . . . . . 115Figure 4.5 The number of expected publications, ENP, for different val-ues of PSP and PWR under a CET policy with d = 0.20, D =0.10, a1 = 0.05 and a SSR. . . . . . . . . . . . . . . . . . . . 116Figure 4.6 The CET and CET-SSR policies are highly sensitive to the cho-sen width of the equivalence margin. . . . . . . . . . . . . . . 120xiiiFigure 4.7 Values of MpwrPUB for CET policy with varying values of k,m, D, a1 and d . . . . . . . . . . . . . . . . . . . . . . . . . . 122Figure 4.8 Values of DSCV for CET policy with varying values of k, m,D, a1 and d . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123Figure 4.9 Values of PubR for CET policy with varying values of k, m, D,a1 and d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124Figure 4.10 Values of OR for CET policy with varying values of k, m, D,a1 and d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125Figure 4.11 Values of PREL for CET policy with varying values of k, m, D,a1 and d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126Figure 4.12 Values of DREL for CET policy with varying values of k, m, D,a1 and d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127Figure 4.13 Values of avgSSPUB for CET policy with varying values of k,m, D, a1 and d . . . . . . . . . . . . . . . . . . . . . . . . . . 128Figure 4.14 Values of JCR for CET policy with varying values of k, m, D,a1 and d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129Figure 4.15 A comparison between CET ecosystems with the RR policy,with regards to the OR metric. . . . . . . . . . . . . . . . . . 130Figure 4.16 A comparison between CET ecosystems with the RR policy,with regards to the PREL metric. . . . . . . . . . . . . . . . . 131Figure 4.17 A comparison between CET ecosystems with the RR policy,with regards to the DSCV metric. . . . . . . . . . . . . . . . 132Figure 4.18 A comparison between CET ecosystems with the RR policy,with regards to the DREL metric. . . . . . . . . . . . . . . . . 133Figure 4.19 A comparison between CET ecosystems with the RR policy,with regards to the JCR metric. . . . . . . . . . . . . . . . . . 134Figure 4.20 A comparison between CET ecosystems with the RR policy,with regards to the avgSSPUB metric. . . . . . . . . . . . . . . 135xivList of Symbols• A: probability of publication for a positive result• AvgSSPUB: average sample size of published studies• B: probability of publication for a null (or negative) result• C: probability of publication for an inconclusive result• c50 and c95: tuning parameters for modelling the sample size requirement• DSCV : the breakthrough discoveries metric• DREL: the reliability of breakthrough discoveries metric• ENP : Expected number of publications• IQpspPUB: the interquartile range of the distribution of psp values amongstpublished studies• JCR: Journal Content Ratio• k: fixed cost of a study• m: “novelty” parameter• MpwrPUB: the median power of published studies• n: the total sample size (n= n1+n2)• n1: the sample size for population 1xv• n2: the sample size for population 1• NPUB: the number of publications metric• nS: the number of studies pursued• n(pwr): the required (total) sample size to obtain a power of pwr• OR: overall reliability• p1: p-value obtained in Step 1 of CET• p2: p-value obtained in Step 4 of CET• pCET : equals p1, if result is positive; equals 1 p2, otherwise• PREL: reliability amongst the subset of published studies that are positive• psp: pre-study probability of a study• PubR: the publication rate metric• pwr: power of a study• REL: the reliability metric• SSR: Sample Size Requirement• T : an arbitrary total budget parameter.• xi1, for i= 1, ...,n1: independent observations in a random sample from pop-ulation 1• xi2, for i= 1, ...,n2 independent observations in a random sample from pop-ulation 2• a: the maximum allowable type I error (NSHT)• a1: the maximum allowable type I error (CET)• a2: the maximum allowable type E error (CET)xvi• d : effect size used in model (chosen value for µd)• D: a positive number equal half the length of the equivalence interval• µ1: the true mean of population 1• µ2: the true mean of population 2• µd : the true difference in population means• |µˆd |PUB: the average of the absolute published effect size• µˆdnegPUB: the proportion of published effect sizes that are negative• r: the equivalence margin• s2: the common variance of populations 1 and 2• q : the parameter of interest in a given testing settingxviiGlossaryBF Bayes FactorCDF Cumulative Distribution FunctionCET Conditional Equivalence TestingCI Confidence IntervalENP Expected Number of PublicationsFN False NegativeFP False PositiveJZS Jeffreys-Zellner-SiowNHST Null Hypothesis Significance TestingOSF Open Science FrameworkRCT Randomized Controlled TrialRR Registered ReportsSSR Sample Size RequirementTN True NegativeTOST Two One-Sided Tests ProcedureTP True PostivexviiiAcknowledgmentsThis thesis would not have been possible without the support of my family, friends,professors and mentors.Firstly, I would like to thank my supervisor, Dr Paul Gustafson. It has been anhonour and a tremendous pleasure to work with you over the last four years. I amgrateful for all your wisdom and advice, and for your willingness to keep an openmind allowing me to pursue my many curiosities including this work with meta-research. I must also thank Dr Darby Thompson, the most capable of statisticiansI know. Many ideas in this thesis were inspired by our research and conversationstogether. Thank you also to Dr John Petkau and Dr Will Welch, members of mysupervisory committee, for their interest and helpful feedback.My interest in statistics and mathematics is due to many wonderful teacherswhom I wish to thank. I am grateful to Dr Charmaine Dean who, during my earliergraduate work, believed in my potential so whole-heartedly. During my undergrad-uate degree, Dr Russel Steele and Dr James Hanley were exceptional mentors. Andfrom much earlier days, merci a` Mme Rodica Motora, an inspiring math teacherwho pushed me forward with discipline and encouragement.I am especially grateful to my parents, who always showed me love and unwa-vering support. In particular, I wish to thank my mother who has shared with meher passion for learning and listens carefully to all my stories and ideas; and myfather who, early on when I was very young, instilled in me a love for science withhis celebrated enthusiasm. Finally, thank you to my beautiful wife Robin. Youinspire me and encourage me every single day. Marrying you will always be thesmartest thing I’ve ever done. None of this work would have been possible withoutyour love.xixChapter 1IntroductionIt is to be remarked that the theory here given rests on the suppositionthat the object of the investigation is the ascertainment of truth. Whenan investigation is made for the purpose of attaining personaldistinction, the economics of the problem are entirely different. Butthat seems to be well enough understood by those engaged in thatsort of investigation.—Note on the Theory of the Economy of Research,Charles Sanders Peirce (1879)Poor reliability, within many scientific fields, is a major concern for researchers,scientific journals and the public at large. In a highly cited publication, Ioannidis(2005) uses Bayes theorem to claim that more than half of published research find-ings are false. While not all agree with the extent of this conclusion (e.g., Goodmanand Greenland (2007), Leek and Jager (2017)), recent large-scale efforts to repro-duce published results in a number of different fields (e.g., economics, Camereret al. (2016); psychology, OpenScienceCollaboration (2015); oncology, Begleyand Ellis (2012)), have only raised concerns about the reliability and reproducibil-ity of published science. That a result may be substantially less reliable than thep-value suggests has been “underappreciated” (Goodman and Greenland, 2007),to say the least.Among statisticians, the merits of Null Hypothesis Significance Testing (NHST)(Cumming, 2014, Hofmann, 2016, Lash, 2017, Szucs and Ioannidis, 2017a) andthe p-value are once again the subject of vigorous debate (Chavalarias et al., 2016,1Gelman, 2013a, Lew, 2013, Wasserstein and Lazar, 2016). Unreliable research notonly reduces the credibility of science (already threatened by a disturbing preva-lence of scientific misconduct; see Fanelli (2009)), but is also very costly (Freed-man et al., 2015). As such, addressing the underlying issues is of “vital impor-tance” (Spiegelhalter, 2017).To address the “reproducibility crisis,” some have taken radical actions (e.g. Trafi-mow andMarks (2015)), while others have proposed adopting measures of “greaterstatistical stringency,” including suggestions to require larger sample sizes and tolower the highly criticized “p < 0.05” significance threshold. Finally, some jour-nal editors have suggested innovative publication policies with the goal of reducingpublication bias.The limited publication of null results is certainly one of the most substantialfactors contributing to low reliability. Whether due to the reluctance of journalsto publish null results or to the reluctance of investigators to submit their nullresearch (Dickersin et al., 1992, Reysen, 2006, Senn, 2012), the consequence issevere publication bias (Doshi et al., 2013, Franco et al., 2014). Despite repeatedwarnings, publication bias persists and, to a certain degree, this is understandable.Accepting a null result can be difficult, owing to the well-known fact that “ab-sence of evidence is not evidence of absence” (Altman and Bland, 1995, Hartunget al., 1983). As Greenwald (1975) writes: “it is inadvisable to place confidence inresults that support a null hypothesis because there are too many ways (includingincompetence of the researcher), other than the null hypothesis being true, for ob-taining a null result.” Indeed, this is the foremost critique of NHST, that it cannotprovide evidence in favour of the null hypothesis. The commonly held belief thatfor a non-significant result to show high “retrospective power” (Zumbo and Hub-ley, 1998) implies support in favour of the null, is problematic (Greenland, 2012,Hoenig and Heisey, 2001).In order to address publication bias, it is often suggested (Dwan et al., 2008,Sterling et al., 1995, Sun˜e´ et al., 2013, Walster and Cleary, 1970) that publicationdecisions should be made without consideration of the statistical significance ofresults, i.e., “result-blind peer review” (Greve et al., 2013). In fact, a growing num-ber of psychology and neuroscience journals are requiring pre-registration (Noseket al., 2017) by means of adopting the publication policy of Registered Reports2(RR) (Chambers et al., 2015), in which authors “pre-register their hypotheses andplanned analyses before the collection of data” (Chambers et al., 2014). If the ratio-nale and methods are sound, the RR journal agrees (before any data are collected)to publish the study regardless of the eventual data and outcome obtained.Among many potential pitfalls with result-blind peer review (Findley et al.,2016), a genuine and substantial concern is that, if a journal were to adopt such apolicy, it might quickly become a “dumping ground” (Greve et al., 2013) for nulland ambiguous findings that do little to contribute to the advancement of science.To address these concerns, RR journals require that, for any manuscript to be ac-cepted, authors must provide a priori (before any data are collected) sample sizecalculations that show statistical power of at least 90% (in some cases 80%). Thisis a reasonable approach for addressing a difficult problem. Still, it is problematicfor two reasons.First, it is acknowledged that this policy will disadvantage researchers whowork with “expensive techniques or who have limited resources” (Chambers et al.,2014). (For example, a RR policy may be particularly disadvantageous for earlycareer researchers; see Allen and Mehler (2018).) While not ideal, small studiescan provide definitive value and potential for learning (Sackett and Cook, 1993).Some go as far as arguing against any requirement for a priori sample size calcu-lations (i.e., needing to show sufficient power as a requisite for publication). Forexample, Bacchetti (2002) writes: “If a study finds important information by blindluck instead of good planning, I still want to know the results;” see also Aycaguerand Galba´n (2013). While unfortunate, the loss of potentially valuable “blind luck”results and small sample studies (Matthews, 1995) appears to be a necessary priceto pay for keeping a “result-blind peer review” journal relevant. Is this too high aprice? Based on simulations, Borm et al. (2009) conclude that the negative impactof publication bias does not warrant the exclusion of studies with low power.Second, a priori power calculations are often flawed, due to the unfortunate“sample size samba” (Schulz and Grimes, 2005): the practice of “tweaking aspectsof sample size calculation” (Hazra and Gogtay, 2016) in order to obtain what isaffordable. Indeed, it is well known that studies advertised as having 80% poweroften have much less. Conducting an appropriate power calculation is often morecomplex than researchers anticipate (Astivia et al., 2019). Even under ideal cir-3cumstances, a priori power estimation is often “wildly optimistic” (Bland, 2009)and heavily biased due to “follow-up bias” (Albers and Lakens, 2018) and the “il-lusion of power” (Vasishth and Gelman, 2017). This “illusion” (also known as“optimism bias”) occurs when the estimated effect size is based on a literaturefilled with overestimates (to be expected in many fields due, somewhat ironically,to publication bias). Djulbegovic et al. (2011) conduct a retrospective analysis ofphase III Randomized Controlled Trials (RCTS) and conclude that optimism biassignificantly contributes to inconclusive results; see also Chalmers and Matthews(2006).What’s more, oftentimes due to unanticipated difficulties with enrolment, theactual sample size achieved is substantially lower than the target set out a pri-ori (Chan et al., 2008). In these situations, RR requires that either the study isrejected/withdrawn for publication or a certain leeway is given under special cir-cumstances (Chambers, 2017). Neither option is ideal. Given these difficulties witha priori power calculations, it remains to be seen to what extent the 90% power re-quirement will reduce the number of underpowered publications that could lead ajournal to become a dreaded “dumping ground.”An alternative proposal to address publication bias and the related issues sur-rounding low reliability is for researchers to adopt Bayesian testing schemes; e.g.,Dienes and Mclatchie (2017), Kruschke and Liddell (2017), Wagenmakers (2007).It has been suggested that with Bayesian methods, publication bias will be mit-igated “because the evidence can be measured to be just as strong either way”(Dienes, 2016).Bayesian methods may also provide for a better understanding of the strengthof evidence (Etz and Vandekerckhove, 2016). However, sample sizes will typi-cally need to be larger than with equivalent frequentist testing in situations whenthere is little prior information incorporated (Zhang et al., 2011). Furthermore, re-searchers in many fields remain uncomfortable with the need to define (subjective)priors and are concerned that Bayesian methods may increase “researcher degreesof freedom” (Simmons et al., 2011).These issues may be less of a concern if studies are pre-registered, and indeed, anumber of RR journals allow for a Bayesian option. At registration (before any dataare collected), rather than committing to a specific sample size, researchers commit4to attaining a certain Bayes Factor (BF). For example, the journals ComprehensiveResults in Social Psychology and NFS Journal (the official journal of the Societyof Nutrition and Food Science) require that one pledges to collect data until theBF is more than 3 (or less than 1/3) (Jonas and Cesario, 2017). The journals BMCBiology and the Journal of Cognition require, as a requisite for publication, a BFof at least 6 (or less than 1/6) (BMC Biology Editorial Board, 2017, Journal ofCognition Editorial, 2018).This thesis presents an investigation of suggested policy responses to the so-called “crisis,” and in addition, offers a novel suggestion to consider: a publicationpolicy based on the concept of Conditional Equivalence Testing (CET). In Chapter2, we introduce the concept of CET, provide comparisons with Bayesian hypothe-sis testing, and describe our novel CET publication policy.While the pros and cons of various publication policy recommendations arevigorously debated, there has been little to no modelling of how adopting proposedmeasures might affect what type of science is published. In Chapter 3, we developa novel model that, given current incentives to publish, predicts a researcher’s mostrational use of resources in terms of the number of studies to undertake, the sta-tistical power to devote to each study, and the desirable pre-study odds to pursue.We then develop a methodology that allows one to estimate the reliability of pub-lished research by considering a distribution of preferred research strategies. Usingthis approach, we investigate the merits of adopting measures of “greater statisticalstringency:” requiring larger sample sizes and lowering the “p< 0.05” significancethreshold.In Chapter 4, we expand the model developed in Chapter 3 in order to evaluatethree alternative publication policies: the Registered Reports policy, the BayesianRegistered Reports policy, and the CET policy introduced in Chapter 2. We con-clude in Chapter 5 with a summary of our main recommendations and the limita-tions of our model. We also consider areas for future research.5Chapter 2A Conditional EquivalenceTesting Publication PolicyWhile equivalence testing is by no means a novel idea, previous attempts to intro-duce equivalence testing have “largely failed” (Lakens, 2017). In Section 2.1, inorder to further our understanding of equivalence testing (and of conditional equiv-alence testing), we provide a brief overview including how to establish appropriateequivalence margins. In Section 2.2, we consider the operating characteristics ofCET and develop methods for conducting sample size calculations.One reason CET is an appealing approach is that it shares many of the prop-erties that make Bayesian testing schemes so attractive. As such, the publicationpolicy we put forward is somewhat similar to the RR “Bayesian option.” WithCET, evidence can be measured in favour of either the alternative or the null (atleast in some pragmatic sense), and as such is “compatible with a Bayesian pointof view” (Ocan˜a i Rebull et al., 2008). In Section 2.3, we conduct a simple simula-tion study to demonstrate how one will often arrive at the same conclusion whetherusing CET or a Bayesian testing scheme.In Section 2.4, we outline how a publication policy, similar to RR, could beframed around CET to encourage the publication of null results. We conclude thischapter in Section 2.5 with further thoughts, context, and suggestions for futureresearch.62.1 An overview of Conditional Equivalence TestingStandard equivalence testing is essentially NHSTwith the hypotheses reversed. Forexample, for a two-sample study of means, the equivalence testing null hypothe-sis would be a difference (beyond the specified margin) in population means, andthe alternative hypothesis would be equal (within a specified margin) populationmeans.Conditional equivalence testing (CET) is the practice of standard NHST fol-lowed, if one fails to reject the null hypothesis, by equivalence testing. Our pro-posal to incorporate equivalence testing into a two-stage testing procedure has notbeen formally proposed elsewhere (one exception may be Hauck and Anderson(1986)). However, CET is not an altogether new way of testing. Rather it is theusage of established testing methods in a way that permits better interpretation ofresults. In this regard, it is similar to other proposals such as the “three-way test-ing” scheme proposed by Goeman et al. (2010), and Zhao (2016)’s proposal forincorporating both statistical and clinical significance into one’s testing. To illus-trate, we briefly outline CET for a two-sample test of equal means (assuming equalvariance). Afterwards, we discuss defining the equivalence margin.We begin by introducing some basic notation. Let xi1, for i = 1, ...,n1 andxi2, for i= 1, ...,n2 be independent random samples from two normally distributedpopulations of interest with µ1, the true mean of population 1; µ2, the true meanof population 2; and s2, the true common population variance. Let n = n1 + n2and define sample means and sample variances as follows: x¯g = Ângi=1 xgi/ng, ands2g = Ângi=1(xgi x¯g)2/(ng1), for g= 1,2. Also, letsp = [{(n11)s21+(n21)s22}/(n1+n22)]1/2. Under the standard null hypoth-esis, H0, the true difference in population means, µd = µ1 µ2, is equal to zero.Under the standard alternative, H1, we have that µd 6= 0.The term equivalence is not used in the strict sense that µ1 = µ2. Instead, equiv-alence in this context refers to the notion that the two means are “close enough,”i.e., their difference is within the equivalence margin, µd 2 [D,D], chosen to de-fine a range of values considered equivalent. The value of D is a positive numberand ideally chosen to be the “minimum clinically meaningful difference” (Greeneet al., 2008, Kaul and Diamond, 2006) or as the “smallest effect size of interest”7(Lakens et al., 2018b).Let Fd f () be the Cumulative Distribution Function (CDF) of the t distributionwith d f degrees of freedom and define the following critical t values: t⇤a1/2 =F1n2(1 0.5a1) (i.e., the (1a1/2)-th percentile of the t-distribution with n 2degrees of freedom) and t⇤a2 =F1n2(1a2), (i.e., the (1a2)-th percentile of the t-distribution with n2 degrees of freedom). As such, a1 is the maximum allowabletype I error (e.g., a1 = 0.05) and a2 is the maximum allowable “type E” error(erroneously concluding equivalence), possibly set equal to a1. If a type E error isdeemed less costly than a type I error, a2 may be set higher (e.g., a2 = 0.10).CET, for a two-sample test of equal means (assuming equal variance), is thefollowing procedure consisting of five steps:Step 1- A two-sided, two-sample t-test for a difference of means.Calculate the t-statistic, t = (x¯1 x¯2)/(sp(1/n1+1/n2)1/2) and associatedp-value, p1 = 2 ·Fn2(|t|).Step 2- If |t| > t⇤a1/2, then declare a positive result. There is evidence of astatistically significant difference, p-value = p1.Otherwise, if |t| t⇤a1/2, proceed to Step 3.Step 3- The TwoOne-Sided Tests Procedure (TOST) for equivalence of means(Meyners, 2012, Walker and Nowacki, 2011).Calculate the two t-statistics: t1 = (x¯1 x¯2+D)/{sp(1/n1+1/n2)1/2}, andt2 = (x¯1 x¯2D)/{sp(1/n1+1/n2)1/2}. Calculate an associated p-value,p2 = max{Fn2(t1),Fn2(t2)}. Note: p2 is a marginal p-value, in that it isnot calculated under the assumption that p1 a1.Step 4- If t1 > t⇤a2 and t2 (t⇤a2), declare a negative result. There is ev-idence of a statistically significant equivalence (with margin r = [D,D]),p-value = p2.Otherwise, proceed to Step 5.Step 5- Declare an inconclusive result. There is insufficient evidence to sup-port any conclusion.For ease of explanation, let us define pCET = p1 if the result is positive, andpCET = 1 p2 if the result is negative or inconclusive. Thus, a small value of pCETsuggests evidence in favour of a positive result, whereas a large value of pCETsuggests evidence in favour of a negative result. As is noted above, it is important8to acknowledge that, despite the above procedure being dubbed “conditional,” p2is a marginal p-value, i.e., it is not calculated under the assumption that p1 a1.The interpretation of p2 is the same regardless of whether it was obtained followingSteps 1 and 2 or was obtained “on its own” via standard equivalence testing.Standard NHST involves the same first and second steps and ends with analternative Step 3 which states that if p1 a1, one declares an inconclusive result(i.e., “there is insufficient evidence to reject the null.”). Similar to two-sided CET,one-sided CET is straightforward, making use of non-inferiority testing in Step 3.While we specifically described CET for a two-sample test of population means,CET is a procedure applicable to any type of outcome. More generally, let q bethe parameter of interest in a given testing setting. CET can be described in termsof calculating Confidence Intervals (CIS) around qˆ , the statistic of interest. Forexample, q may be defined as the difference in proportions, the hazard ratio, therisk ratio, or the slope of a linear regression model, etc. (Chen et al., 2000, da Silvaet al., 2009, Dannenberg et al., 1994, Dixon and Pechmann, 2005, Hauschke et al.,1990, Wellek, 2010, Wiens and Iglewicz, 2000). (Note that in these other cases,the equivalence margin, r , may not necessarily be centred at zero, but will (gen-erally) be a symmetric interval around q0, the value of q under the standard nullhypothesis. In general, let r = [rL,rU ], with D equal to half the length of the inter-val.) Consider, more generally, CET as the following procedure using confidenceintervals (CIs):Step 1- Calculate a (1a1)% CI for q .Step 2- If this CI excludes q0, then declare a positive result.Otherwise, if q0 is within the CI, proceed to Step 3.Step 3- Calculate a (12a2)% CI for q .Step 4- If this CI is entirely within r , declare a negative result.Otherwise, proceed to Step 5.Step 5- Declare an inconclusive result.There is insufficient evidence to support any conclusion.(Note that in Step 3 we consider a (1 2a2)% CI as supposed to a (1a2)%CI. This is due to the “intersection-union-principle” (Berger et al., 1996). We areessentially performing two separate tests, hence 2a2: one test for the upper end of9the margin and another for the lower end of the margin.)Figure 2.1 illustrates the three conclusions with their associated confidenceintervals. We once again consider two-sample testing of normally distributed dataas before, with a1 = 0.05 and a2 = 0.10. Situation “a,” in which the (1a1)%CI is entirely outside the equivalence margin is what Guyatt et al. (1995) call a“definitive-positive” result. The lower confidence limit of the parameter is notonly larger than zero (the null value, q0), implying a “positive” result, but also isabove the D threshold. Situations “b” and “c,” in which the (1a1)% CI excludeszero and the (1 2a2)% CI is within [D,D] are considered “positive” resultsbut require some additional interpretation. One could describe the effect in thesecases as “significant yet not meaningful” or conclude that there is evidence of asignificant effect, yet the effect is likely “not minimally important.”A “positive” result such as “d,” with a wider CI, represents a significant, al-beit imprecisely estimated, effect. One could conclude that additional studies, withlarger sample sizes, are required to determine if the effect is of a meaningful mag-nitude.Evidently, in some cases, the three categories are not sufficient on their ownfor adequate interpretation. For example, without any additional information the“positive” vs “negative” distinction between cases “b” and “f” is incomplete andpotentially misleading. These two results appear very similar: in both cases asubstantial (i.e., greater than D-sized) effect can be ruled out. Indeed, positiveresult “b” appears much more like the negative result “f” than positive result “a.”Additional language and careful attention to the estimated effect size is required forcorrect interpretation. While case “k” has a similar point estimate to “b” and “f,”the wider CI (the equivalence margin is within the (12a2)% CI) means that onecannot rule out the possibility of a meaningful effect, and so it is rightly categorizedas “inconclusive.”It may be argued that CET is simply a recalibration of the standard confidenceinterval, like converting Fahrenheit to Celsius. This is valid commentary and inresponse, it should be noted that our suggestion to adopt CET (in the place ofstandard confidence intervals) is not unlike suggestions that confidence intervalsshould replace p-values (Cumming, 2008, Gardner and Altman, 1986, Reichardtand Gollob, 2016).10One advantage of CET over confidence intervals is that it may improve theinterpretation of null results (Hauck and Anderson, 1986, Parkhurst, 2001). Byclearly distinguishing between what is a negative versus an inconclusive result,CET serves to simplify the long “series of searching questions” necessary to eval-uate a “failed outcome” (Pocock and Stone, 2016). However, as can be seen withthe examples of Figure 2.1, the use of CET should not rule out the complementaryuse of confidence intervals. Indeed, the best interpretation of a result will be whenusing both tools together.As Dienes and Mclatchie (2017) clearly explain, with standard NHST, one isunable to make the “three-way distinction” between the positive, inconclusive andnegative (“evidence for H1,” “no evidence to speak of,” and “evidence for H0”).Figure 2.2 shows the distribution of standard two-tailed p-values under NHST andcorresponding pCET values under CET for three different sample sizes. These arethe result of two-sample testing of normally distributed data with n1 = n2 (assum-ing equal variance).Under the standard null (µd = 0) with NHST, one is just as likely to obtaina large p-value as one is to obtain a small p-value. Under the standard null withCET, a small p2 value (or a large pCET value) will indicate evidence in favour ofequivalence given sufficient data.In Figure 2.2, note that the blue points that fall below the a1 = 0.05 threshold inthe upper panels (NHST) remain unchanged in the lower panels (CET). However,blue points that fall above the a1 = 0.05 threshold in the upper panels (NHST) areno longer present in the lower panels (CET). They have been replaced by blackpoints (= pCET = 1 p2) which, if near the top, suggest evidence in favour ofequivalence. This treatment of the larger p-values is more conducive to the inter-pretation of null results, bringing to mind the thoughts of Amrhein et al. (2017)who write “[a]s long as we treat our larger p-values as unwanted children, theywill continue disappearing in our file drawers, causing publication bias, which hasbeen identified as the possibly most prevalent threat to reliability and replicabil-ity.” We also wish to note the similarity between the pCET statistic and the re-cently proposed “second-generation p-value” (Blume et al., 2018, 2019). Indeed,the pCET has many of the same desirable inferential properties; see Lakens andDelacre (2018) for a related comparison and discussion.112.1.1 Defining the equivalence marginAs with standard equivalence and non-inferiority testing, defining the equivalencemargin will be one of the “most difficult issues” (Hung et al., 2005) for CET. If themargin is too large, then a claim of equivalence is meaningless. If the margin is toosmall, then the probability of declaring equivalence will be substantially reduced(Wiens, 2002).As stated earlier, the margin is ideally chosen as a boundary to exclude thesmallest effect size of interest (Lakens et al., 2018b). However, these can be diffi-cult to define, and there is generally no clear consensus among stakeholders (Keefeet al., 2013). Furthermore, previously agreed-upon meaningful differences may bedifficult to ascertain as they are rarely specified in protocols and published results(Djulbegovic et al., 2011).In some fields, there are some generally accepted norms. For example, inbioavailability studies, equivalence is routinely defined (and listed by regulatoryauthorities) as a difference of less than 20%. In oncology trials, a small effect sizehas been defined as odds ratio or hazard ratio of 1.3 or less (Bedard et al., 2007).In ecology, a proposed equivalence region for trends in population size (the log-linear population per year regression slope) is r = [0.0346,0.0346] (Dixon andPechmann, 2005).In cases when a specified equivalence margin may not be as clear-cut, lessconventional options have been put forth (e.g., the “equivalence curve” of Hauckand Anderson (1986), and the least equivalent allowable difference of Meyners(2007)). It has been previously suggested that the margin can be defined as afunction of the observed variance; for further details see Lakens (2017) and Lakenset al. (2018b). For example, in our two-sample normal case, one could defineequivalence to be a difference within half the estimated standard deviation, i.e.,defining D= qsp, with pre-specified q= 0.5. This is problematic.Recall that hypotheses are statements about parameters and not about the ob-served data. Therefore, hypotheses are nonrandom. As such, D being defined as afunction of the data invalidates the proposed hypotheses, H0 : |µd |> qsp. A correctchoice would be to define the parameter of interest to be the standardized effectsize (e.g., µd/s ) and then define the margin on the standardized scale; we take12H0 : |µd |> qs . While in practice, the difference between these two H0 statementsmay seem minuscule (or may not when s is very hard to estimate with a smallsample size), it should nevertheless be acknowledged. It is similar to the familiaryet often mistaken assumption that s is known and equal to sˆ . Wellek (2017)notes that the distinction is “often ignored.”For binary and time-to-event outcomes, there is an even greater number ofdifferent ways one can define the margin (Barker et al., 2001, da Silva et al., 2009,Ng, 2008, Tsou et al., 2007). Since the choice of margin is often difficult in the bestof circumstances, a retrospective choice is not ideal as there will be ample roomfor bias in one’s choice, regardless of how well intentioned one may be. For thisreason, for equivalence and non-inferiority RCTs, it is generally expected (and werecommend, when possible with CET) that margins are to be pre-specified (Piaggioet al., 2006).This being said, it must be noted that the validity of an equivalence test does notdepend on the margin being pre-specified. Rather, the necessary requirement for avalid test is that the margin is completely independent of the observed data. Fur-thermore, simply because a margin has been pre-specified (and is therefore guar-anteed to be independent of the data), it is not necessarily an appropriate choice;see further thoughts in Campbell and Gustafson (2018b).2.2 CET operating characteristics and sample sizecalculations.It is worth noting that there is already a large literature on TOST and non-inferioritytesting. See Walker and Nowacki (2011) and Meyners (2012) for overviews thatcover the basics as well as more subtle issues. There are also many proposedalternatives to TOST for equivalence testing, what are known as “the emperor’snew tests” (Perlman et al., 1999). These alternatives are offered as marginallymore powerful options, yet are more complex and are not widely used.Our proposal to systematically incorporate equivalence testing into a two-stagetesting procedure has not been extensively pursued elsewhere (one exception maybe Hauck and Anderson (1986)) and there has not been any discussion of howsuch a testing procedure could facilitate publication decisions for peer-review jour-13nals. One reason for this is a poor understanding of the conditional equivalencetesting strategy whereby testing traditional non-equivalence is followed condition-ally (if one fails to reject the null), by testing equivalence. In fact, whether or notsuch a two-stage approach is beneficial has been somewhat controversial. As thesample-size is typically determined based only on the primary test, the power of thesecondary equivalence (non-inferiority) test is not controlled, thereby potentiallyincreasing the false discovery rate (Ng, 2003). Koyama and Westfall (2005) inves-tigate and conclude that, in most situations, such concern is unwarranted. However,a better understanding of the operating characteristics of CET is clearly required.What follows is a description of how to calculate the probabilities of obtain-ing each of the three conclusions (positive, negative, and inconclusive) for CETfor two-sample testing of normal data with equal variance (as described in the fivesteps listed above in Section 2.1). Related work includes Shieh (2016) who estab-lishes exact power calculations for TOST for equivalence of normally distributeddata; da Silva et al. (2009) who review power and sample size calculations forequivalence testing of binary and time-to-event outcomes; and Zhu (2017) whoconsiders sample size calculations for equivalence testing of Poisson and negativebinomial rates.As before, let the true population mean difference be µd = µ1µ2 and the truepopulation variance equal s2. Let s⇤=s(1/n1+1/n2)1/2 and s⇤= sp(1/n1+1/n2)1/2.Figure 2.1 (right panel) illustrates how each of the three conclusions can be reachedbased on the values of s⇤ and µˆd = x¯1 x¯2 obtained from the data. For this plot,n= 90 (with n1 = n2), D= 0.5, a1 = 0.05 and a2 = 0.10.The black lettered points on the right panel of Figure 2.1 correspond to thescenarios on the left panel. The interior diamond, ⌃, (with corners located at [0,0],[D/(t⇤a2/t⇤a1/2+1),D/(t⇤a1/2+t⇤a2)], [0,D/t⇤a2 ], [D/(t⇤a2/t⇤a1/2+1),D/(t1+t2)]) cov-ers the values for which a negative conclusion is obtained. A positive conclusioncorresponds to when |x¯1 x¯2| is large and s⇤ is relatively small.Note how D and s impact the conclusion. If the equivalence margin is suffi-ciently wide, one will have Pr(inconclusive) approach zero as s approaches zero.Indeed, the ratio of D/s determines, to a large extent, the probability of a negativeresult. If the equivalence margin is sufficiently narrow and the variance relativelylarge (i.e., D/s is very small), one will have Pr(negative)⇡ 0.14Let us assume for simplicity that n1 = n2. Then, the sampling distributionsof the sample mean and sample variance are well established: µˆd ⇠ N(µd ,s⇤2)and s2(n2)s2p ⇠ c2n2. Therefore, given fixed values for µd and s2, we cancalculate the probability of obtaining a positive result, Pr(positive).In Figure 2.1 (right panel), Pr(positive) equals the probability of µˆd and s⇤falling into either the left or right “positive” corners and is calculated (as in a usualpower calculation for NHST):Pr(positive;µd ,s) =⇣1Fn2, µds⇤⇣t⇤a1/2⌘⌘+Fn2, µds⇤⇣ t⇤a1/2⌘(2.1)where Fd f ,ncp(x) is the cdf of the non-central t distribution with d f degrees offreedom and non-centrality parameter ncp.One can calculate the probability of obtaining a negative result, Pr(negative),as the probability of µˆd and s⇤ falling into the “negative” diamond, ⌃. Since µˆdand s⇤ are independent statistics, we can write their joint density as the productof a normal probability density function, fN(), and a chi-squared probability den-sity function, fc2(). However, the resulting double integral will remain difficultto evaluate algebraically over the boundary, ⌃. Therefore, the probability is bestapproximated numerically, for example, by Monte Carlo integration, as follows:Pr(negative;µd ,s) =Z Z⌃fN(u;µd ,s⇤2) fc2(v;n2)dudv=Z⌃⇣F(h2(v);µd ,s⇤)F(h1(v);µd ,s⇤)⌘fc2(v;n2)dv⇡MÂj=1⇣F(h2(c j);µd ,s⇤)F(h1(c j);µd ,s⇤)⌘/Mwhere F() is the normal cdf and Monte Carlo draws from a chi-squared distri-bution provide c j = {s2q[ j]/(n 2)}1/2, with q[ j] ⇠ c2n2 for j = 1, ...,M. Theleft and right-hand boundaries of the diamond-shaped “negative region,” are de-fined by h1(c j) =min{0,max(+c jt2D,c jt1)} and h2(c j) =max(0,min(c jt2+D,+c jt1)).15Defining the boundary with h1() and h2() allows for three distinct cases as seenin Figure 2.1 (right panel):(1) s⇤ > D/t⇤a2 , in which case h1(s⇤) = h2(s⇤) = 0;(2) D/(t⇤a1/2+ t⇤a2) < s⇤ < D/t⇤a2 , in which case h1(s⇤) = D+ t⇤a2s⇤ and h2(s⇤) =D t⇤a2s⇤; and(3) s⇤ < D/(t⇤a1/2+ t⇤a2), in which case h1(s⇤) =t⇤a1/2s⇤ and h2(s⇤) = +t⇤a1/2s⇤.In order to determine an appropriate sample size for CET, one must replaceµd with an a priori estimate, µ˜d , the “anticipated mean difference,” and replaces2 with an a priori estimate, s˜2, the “anticipated variance.” Then one might beinterested in calculating six values: the probabilities of obtaining each of the threepossible results (positive, negative and inconclusive) under two hypothetical sce-narios, (1) where µd = 0, and (2) where µd equal to µ˜d , the value anticipated givenresults in the literature.One might also be interested in considering the likelihood of an inconclusivefinding (under both the null and alternative). Since the objective of any studyshould be to obtain a conclusive result, it seems reasonable to calculate samplesize with the objective to maximize the likelihood of success, i.e., to minimizePr(inconclusive).The quantity:Pr(success)= 1 12Pr(inconclusive;µd = µ˜d , s˜2) 12Pr(inconclusive;µd = 0, s˜2)(2.2)represents the probability of a “successful” study under the assumption that thenull (µd = 0) and alternative (with specified µd = µ˜d) are equally likely. (Notethat both false positive, and false negative studies in this equation are considered“successful.”) This weighted average could be considered a simple version of whatis known as “assurance” (O’Hagan et al., 2005).Figure 2.3 shows how consideration of Pr(success) as the criteria for determin-ing sample size attenuates the effect of µ˜d on the required sample size. Supposeone calculates that the required sample size is n = 1,000 based on the desire for90% statistical power and the belief that s2 = 1 and µd = 0.205, with n1 = n2 asbefore. This corresponds to a 61% probability of success for D=0.1025 (=12µd). If16the true variance is slightly larger than anticipated, s2 = 1.25, and the differencein means smaller, µd = 0.123, the actual sample size needed for 90% power is infact n= 3,476, while the actual sample size needed for Pr(success|D,d,s) = 61%is n= 1,770. On the other hand, if the true variance is slightly smaller than antici-pated, s2 = 0.75 and the difference in means greater, µd = 0.33, the actual samplesize needed for 90% power is only n= 288, while the actual sample size needed forPr(success|r,d,s) = 61% is n = 638. It follows that, if one has little certainty inµd and s2, calculating the required sample size with consideration of Pr(success)may be less risky.For related work on statistical power, see Shao et al. (2008) who propose ahybrid Bayesian-frequentist approach to evaluate power for testing both superiorityand non-inferiority. Jia and Lynn (2015) discuss a related sample size planningapproach that considers both statistical significance and clinical significance.Jiroutek et al. (2003) advocate that, rather than calculate statistical power (i.e.,the probability of rejecting H0 : q = q0 should the alternative, H1, be true), oneshould calculate the probability that the width of a confidence interval is less than afixed constant and the null hypothesis is rejected, given that the confidence intervalcontains the true parameter. Most recently, Rothman and Greenland (2018) discussa similar idea for planning the size of epidemiological studies.2.3 A comparison with Bayesian testingRecently, Bayesian methods have been advocated for as a “possible solution to pub-lication bias” (Konijn et al., 2015). In particular, there have been many Bayesiantesting schemes proposed in the psychology literature; see Go¨nen (2010), Mulderand Wagenmakers (2016) for details. What’s more, publication policies based onBayesian testing schemes are currently in use by a small number of journals andare the preferred approach for some (Dienes and Mclatchie, 2017). In response tothese developments, we will compare, with regards to their operating characteris-tics, CET and a Bayesian testing scheme.This comparison brings to mind Dienes (2014) who compares testing withBayes Factors to testing with “interval methods” and notes that with interval meth-ods, a study result is a “reflection of the data,” whereas with BFs the result reflects17the “evidence of one theory over another.” What follows is a brief overview ofone Bayesian scheme and an investigation of how it compares to CET. (We recog-nize that this scheme is based on reference priors while some advocate for weaklyinformative priors or elicitation of subjective priors.)The Bayes Factor is a valuable tool for determining the relative evidence of anull model compared to an alternative model (Hoekstra et al., 2017). Consider, forthe two-sample testing of normally distributed data, a Bayes Factor testing schemein which we take the Jeffreys-Zellner-Siow (JZS) prior (recommended as a reason-able “objective prior”) for the alternative hypothesis.Note that, for the Bayes Factor, the null hypothesis, H0, corresponds to µd = 0;and the alternative, H1, corresponds to µd 6= 0. The JZS testing scheme (seeRouder et al. (2009) for additional details) involves placing a normal prior onh = (µ2µ1)/s , h ⇠ Normal(0,s2h), and placing an inverse chi-squared prior,s2h ⇠ inv.c2(1) on the hyper-parameter sh . Integrating out sh shows that this isequivalent to having a Cauchy prior, h ⇠Cauchy. We can write the JZS-Bayes Fac-tor in terms of the standard t-statistic, t = (x¯1 x¯2)/s⇤, with n⇤ = n1n2/(n1+n2),and v= n1+n22, as follows:B01 =(1+ t2v )(v+1)/2R •0 (1+n⇤g)1/2⇣1+ t2(1+n⇤g)v⌘(v+1)/2(2p)1/2g3/2e1/(2g)dg(2.3)Large values greater than 1 for B01 indicate evidence in favour of the null, whereassmall values less than 1 indicate evidence in favour of the alternative.Notice that the probability density function of an Inverse-Gamma(0.5,0.5) ran-dom variable is:g(x) =0.50.5G(0.5)x3/2e1/(2x) (2.4)Therefore, we can re-express this as 1/B01=E[ f (G)], whereG⇠ Inverse-Gamma(0.5,0.5),and where we define the function f () as follows:18f (x) = (1+n⇤x)1/2exph({(v+1)/2}) ·⇣log(1+t2(1+n⇤x)v ) log(1+t2v)⌘i(2p)1/2 G(0.5)0.50.5(2.5)The integral in the denominator of equation (2.3) may make B01 difficult toevaluate algebraically. One possibility for numerical evaluation is to use the methodof Monte Carlo integration, as follows:B01 ⇡MÂj=1{ f (g[ j])}/Mwhere f () is as defined above in equation (2.3) and g[ j] are independent MonteCarlo draws from a Inverse-Gamma(0.5, 0.5) distribution.Figure 2.4 (A.) shows how the JZS-Bayes Factor changes with sample sizefor four different values of the observed difference in means, µˆd = x¯1 x¯2; theobserved variance is held constant at s2p = 1. When the observed mean differenceis exactly 0, the BF increases logarithmically with n. For small to moderate µˆd ,the BF supports the null for small values of n. However, as n becomes larger,the BF yields less support for the null and eventually favours the alternative. Thehorizontal lines mark the 3:1 and 1:3 thresholds (“moderate evidence”) as well asthe 10:1 and 1:10 thresholds (“strong evidence”).Figure 2.4 (C.) shows pCET -values for the same four values of µˆd , constants2p = 1, and the equivalence margin of [0.50,0.50]. The curves suggest that CETpossesses similar “ideal behaviour” (Rouder et al., 2009) as is observed with theBF. When the observed mean difference is exactly zero, CET provides increasingevidence in favour of equivalence with increasing n. For small to moderate µˆd ,CET supports the null for small values of n but, as n becomes larger, at a certainpoint favours the alternative. The sharp change-point represents border cases.Consider case “f” in Figure 2.1: if n increased, the confidence intervals wouldshrink and at a certain point, the (1a1)% CI would exclude 0 (similar to case“b”). At that point, the result abruptly changes from “negative” to “positive.” Whilethis abrupt change may appear odd at first, it may in fact be more desirable thanthe smooth transition of the BF. Consider for example when µˆd = 0.25. Then for19n between 112 and 496, the BF will be strictly above 1/3 and strictly below 3, andas such the result, by BF, is inconclusive. In contrast, for the same range in samplesize, pCET will be either above 0.90 or below 0.05. As such, with a1 = 0.05 anda2 = 0.10, a conclusive result is obtained. While careful interpretation is required(e.g., “the effect is significant yet not of a meaningful magnitude”), this may bepreferable in some settings to the BF’s inconclusive result.We can also consider the posterior probability of H0 (i.e., µd = 0) equal toB01/(1+B01) (when the prior probabilities Pr(H0) and Pr(H1) are equal), plottedin Figure 2.4 (B.). The similarities and differences between p-values and poste-rior probabilities have been widely discussed previously (Berger and Delampady,1987, Greenland and Poole, 2013, Marsman and Wagenmakers, 2017). Figure 2.4suggests that the JZS-BF and CET testing may often result in similar conclusions.We investigate this further by means of a simple simulation study.2.3.1 Simulation StudyWe conducted a small simulation study to compare the operating characteristicsof testing with the JZS-BF relative to with the CET approach. CET conclusionswere based on setting D = 0.50, a1 = 0.05 and a2 = 0.10. JZS-BF conclusionswere based on a threshold of 3 or greater for evidence in favour of a negative resultand less than 1/3 for evidence in favour of a positive result. BFs in the 1/3 - 3range correspond to an inconclusive result. A threshold of 3:1 is often considered“substantial evidence” in Bayesian analyses (Wagenmakers et al., 2011). We alsoconducted the simulations with a 6:1 threshold for comparison.Note that one advantage of Bayesian methods is that sample sizes need not bedetermined in advance (Rouder, 2014). With frequentist testing, interim analysesto re-evaluate one’s sample size can be performed (while maintaining the Type1 error). However, these must be carefully calibrated and ideally planned in ad-vance (Jennison and Turnbull, 1999, Lakens, 2014). Scho¨nbrodt andWagenmakers(2016) list three ways one might design the sample size for a study using the BFfor testing. For the simulation study here we examine only the “fixed-n design.”For a range of µd (= 0, 0.07, 0.09, 0.13, 0.18, 0.25, 0.35, 0.48, 0.67) and 14different sample sizes (n ranging from 10 to 5,000, with n1 = n2) we simulated20normally distributed two-sample datasets (with s2 = 1). For each dataset, we ob-tained CET p-values and JZS-BFs, and declared the result to be positive, negativeor inconclusive accordingly. Results are presented in Figure 2.5 (and Figures 2.6,2.7 and 2.8), based on 10,000 distinct simulated datasets per scenario.Several findings merit comment (unless otherwise noted, comments refer toresults displayed in Figure 2.5) :• In this simulation study, the JZS-BF admits a very low frequentist type Ierror, recorded at most ⇡ 0.01, for a sample size of n= 110. As the samplesize increases, the frequentist type I error diminishes to a negligible level.• The JZS-BF requires less data to reach a negative conclusion than the CET.However, with moderate to large sample sizes (n = 100 to 5,000) and smalltrue mean differences (µd = 0 to 0.25), both methods are approximatelyequally likely to deliver a negative conclusion.• While the JZS-BF requires less data to reach a conclusion when the truemean difference is small (µd = 0 to 0.25) (the solid black curve drops morerapidly than the dashed grey line), there are scenarios in which larger samplesizes will surprisingly reduce the likelihood of obtaining a conclusive result(the solid black curve drops abruptly then rises slightly as n increases forµd = 0.07, 0.09, 0.13, and 0.18.)• The JZS-BF is always less likely to deliver a positive conclusion (the dashedblue curve is always higher than the solid blue curve). In the scenarios likethose considered, JZS-BF may require larger sample sizes for reaching apositive conclusion and may be considered “less powerful” in a traditionalfrequentist sense.• Figure 2.6 shows the probability that, for a given dataset, both methodsagree as to whether the result is “positive,” “negative,” or “inconclusive.”Across the range of effect sizes and samples sizes considered, the probabil-ity of agreement is always higher than 50%. For the strictly dichotomouschoice between “positive” and “not positive,” the probability that both meth-ods agree is above 70% for all cases considered except a few scenarios in21which effect size is small (µd = 0.07,0.09,0.13, and 0.18) and sample sizeis large (n> 500).• With the 6:1 BF threshold, results are similar. The main difference withthis more stringent threshold is that much larger sample sizes are requiredin order to achieve high probabilities of obtaining a conclusive result, seeFigure 2.7. Also, agreement with CET is substantially lower, see Figure 2.8.The results of the simulation study suggest that, in many ways, the JZS-BFand CET operate quite similarly. It is reasonable to think of JZS-BF and CET astwo pragmatically similar, yet philosophically different, tools for making “trichoto-mous significance-testing decisions.”2.4 A CET publication policyMany researchers have put forth ideas for new publication policies aimed at ad-dressing the issue of publication bias. There is a wide range of opinions on howto incorporate more null results into the published literature (Dirnagl et al., 2010,Ioannidis, 2006, Shields, 2000).RR is one of many proposed “two-step” manuscript review schemes in whichacceptance for publication is granted prior to obtaining the results, e.g., Lawlor(2007), Mell and Zietman (2014), Smulders (2013), Walster and Cleary (1970).Figure 2.9 (left panel) illustrates the RR procedure and Chambers et al. (2014)provide an in-depth explanation answering a number of frequently asked questions.The RR policy has two central components: (1) pre-registration and (2) the “RRcommitment to publish.”Pre-registration can be extremely beneficial as it reduces “researcher degreesof freedom” (Simmons et al., 2011) and prevents, to a large degree, many question-able research practices including data-dredging (Berry, 1990), the post-hoc fabrica-tion of hypotheses (“HARKing”) (Kerr, 1998), and p-hacking (Gelman and Loken,2013). However, on its own, pre-registration does little to prevent publication bias.This is simply because pre-registration: (a1) cannot prevent authors from disre-garding negative results (Song et al., 2014); (a2) does nothing to prevent reviewersand editors from rejecting studies for lack of significance; and (a3) does not guar-22antee that peer reviewers consider compliance with the pre-registered analysis plan(Chan et al., 2008, Mathieu et al., 2013, van Lent et al., 2015).Consider the field of medicine as a case study. For over a decade, pre-registrationof clinical trials has been required by major journals as a prerequisite for publi-cation. Despite this heralded policy change, selective outcome reporting remainsever prevalent (Huic´ et al., 2011, Mathieu et al., 2009, Ramsey and Scoggins, 2008,Ross et al., 2009, Song et al., 2017). (This being said, new 2017/2018 guidelinesfor the clinicaltrial.gov registry show much promise in addressing a1, a2, and a3;see Zarin et al. (2016).)In order to prevent publication bias, RR complements pre-registration with a“commitment to publish.” In practice this consists of an “in principle acceptance”policy along with the policy of publishing “withdrawn registration” studies. Inorder to counter authors who may simply shelve negative results following pre-registration (a1), RR journals commit to publishing the abstracts of all withdrawnstudies as withdrawn registration papers. By guaranteeing that, should a study fol-low its pre-registered protocol, it will be accepted for publication (“in principleacceptance”), RR prevents reviewers and editors from rejecting a study based onthe results (a2). Finally, RR requires that a study is in strict compliance with thepre-registered protocol if it is to be published (a3). In order to keep a RR journalrelevant (not simply full of inconclusive studies), RR requires, as part of registra-tion, that a researcher commits to a sample size large enough of achieve 90% (insome cases 80%) statistical power. (In a small number of RR journals, a Bayesianalternative option is offered. Instead of committing to a specific sample size, re-searchers commit to achieving a certain BF. See Section 4.2 for further details.)The policy we put forth here is not meant to be an alternative to the “pre-registration” component of RR. Its benefits are clear, and in our view is most often“worth the effort” (Wager and Williams, 2013). If implemented properly, pre-registration should not “stifle exploratory work” (Gelman, 2013b). Instead, whatfollows is an alternative to the second “commitment to publish” component.232.4.1 Outline of a CET-based publication policyFigure 2.9 (right panel) illustrates the steps of our proposed policy. What followsis a general outline.Registration- In the first stage of a CET-based policy, before any data are col-lected, a researcher will register the intent to conduct a study with a journal’s editor.As in the RR policy, this registration process will (1) detail the motivations, merits,and methods of the study, (2) list the defined outcomes and various hypotheses tobe tested, and (3) state a target sample size, justified based on multiple grounds.However, unlike in the RR policy, failure to meet a given power threshold (e.g.,failure to show calculations that give 80% power) will not generally be groundsfor rejection (see Section 2.5 for further commentary on this point). Rather, regis-tration under a CET-based policy will be based on defining an equivalence marginfor each hypothesis test to be carried out. The researcher will also need to note ifthere are plans for any sample size reassessments, interim and/or futility analyses(common in many large clinical trials; see Choi (1997)).Editorial and Peer Review- If the merits of the study satisfy the editorialboard, the registration study plan will then be sent to reviewers to assess whetherthe methods for analysis are adequate and whether the equivalence margins aresufficiently narrow. Once the peer reviewers are satisfied (possibly after revisionsto the registration plan), the journal will then agree, in principle, to accept the studyfor eventual publication, on condition that either a positive or a negative result isobtained.Data collection- Armed with this “in principle acceptance,” the researcher willthen collect data in an effort to meet the established sample size target. Once thedata are collected and analyses complete, under normal circumstances, the studywill be published if and only if either a positive or negative result is obtained asdefined by the pre-specified equivalence margins. (This is the incentive to use alarge sample size).As is required practice for good reporting, any failure to meet the target sam-ple size should be stated clearly, along with a discussion of the reasons for failureand the consequences with regards to the interpretation of results, see Toerien et al.(2009). Inconclusive results will not generally be published, thus protecting a jour-24nal from becoming a “dumping ground” for failed studies. The journal will how-ever maintain a record of the inconclusive result and this record should be madeavailable to the public. In very rare circumstances, it may be determined that aninconclusive study offers a substantial contribution to the field and should thereforebe considered for publication.This proposed policy is most similar to the RR Bayesian option, the BayesFactor Registered Reports (BF-RR) policy, outlined in Section 2.3. Journals onlycommit (“in principle acceptance”) to publish conclusive studies (i.e., studies witha small p-value (either a small p1 or a small p2) or, studies with a BF above orbelow a certain threshold) and no a priori sample size requirements are forced upona researcher. As noted in Section 2.3, we expect that the conclusions obtained undereither policy will often be the same. The CET based policy therefore represents afrequentist alternative to the BF-RR policy.A journal may wish to require stricter or weaker requirements for publicationof results and this can be done by setting thresholds for a1 and a2 accordingly,(e.g., stricter thresholds: a1 = 0.01,a2 = 0.05,D = 0.5; weaker thresholds: a1 =0.05,a2 = 0.10,D= 0.5).“Gaming the system”- When evaluating a new publication policy one shouldalways ask how (and how easily) a researcher could -unintentionally or not- “gamethe system.” For the CET policy as described above, we see one rather obviousstrategy.Consider the following, admittedly extreme, example. A researcher submits 20different study protocols for pre-registration each with very small sample size tar-gets (e.g., n= 8) and a pre-specified equivalence margin of [0.50,0.50]. Suspendany disbelief, and suppose these studies all have good merit, are all well written,and are all well designed (besides being severely underpowered), and are thereforeall granted in principle acceptance. Then, in the event that the null is true (i.e.,µ = 0) for all 20 studies (and with a1 = 0.05), it is expected that at least one studyout of the 20 will obtain a positive result and thus be published. This is publicationbias at its worst.In order to discourage this unfortunate practice, we suggest making a researcher’shistory of pre-registered studies available to the public (e.g., using ORCID (Haaket al., 2012)). However, while potentially beneficial, we do not anticipate this type25of action being necessary. With the CET policy, it is no longer in a researcher’sinterest to “game the system.”Consider once again, our extreme example. The strategy has an approximately64% chance of obtaining at least one publication at a cost of 20 submissions and atotal of nTOTAL = 160 = 8 · 20 observations. If the goal is to maximize the proba-bility of being published (Charlton and Andras, 2006), then it is far more efficientto submit a single study with n = 160, in which case there is an approximately98% chance of obtaining a publication (i.e., with a2 = 0.10, Pr(inconclusive|µ =0,s2 = 1,D= 0.5) = 0.02; or a 92% chance with a2 = 0.05, Pr(inconclusive|µ =0,s2 = 1,D = 0.5) = 0.08). We will further explore these dynamics in Chapter 4when we model the incentive structures driving research under alternative publica-tion policies.2.5 ConclusionThe publication policy outlined here may be welcomed by journal editors, re-searchers and all those who wish to see more reliable science. Research jour-nals which wish to remain relevant and gain a high impact factor should welcomethe CET policy as it offers a mechanism for excluding inconclusive results whileproviding a space for potentially impactful negative studies. As such, embracingequivalence testing is an effective way to make publishing null results “more attrac-tive” (O’Hara, 2011). Researchers should be pleased with a policy that providesa “conditional in principle acceptance” and does not insist on specific sample sizerequirements that may not be feasible or desirable.It must be noted that publication policies need not be implemented as “blanketrules” that must apply to all papers submitted to a given journal. The RR policy forinstance, is often only one of several options offered to those submitting researchto a participating journal. Furthermore, RR is not uniformly implemented acrossparticipating journals. Each journal has adapted the general RR framework to suitits objectives and the given research field. We envision similar possibilities for aCET policy in practice. Journals may choose to offer it as one of many options forsubmissions and will no doubt adapt the policy specifics as needed. For example, ina field where underpowered research is of particular concern, a journal may wish26to implement a modified CET publication policy that does include strict samplesize/power requirements. We further discuss and investigate the merits of variousconfigurations of the CET policy in Chapter 4.The requirement to specify an equivalence margin prior to collecting data willhave the additional benefit of forcing researchers and reviewers to think about whatwould represent a meaningful effect size before embarking on a given study. Whilethere will no doubt be pressure on researchers to “p-hack” in order to meet eitherthe a1 or a2 threshold (see Chuard et al. (2019)), this can be discouraged by in-sisting that an analysis strictly follows the pre-registered analysis plan. Adoptingstrict thresholds for significance can also act as a deterrent. Finally, we believe thatthe CET policy may improve the reliability of published science by not only al-lowing for more negative research to be published, but by modifying the incentivestructure driving research (Nosek et al., 2012).Using an optimality model, Higginson and Munafo` (2016) conclude that, givencurrent incentives, the rational strategy of a scientist is to “focus almost all oftheir research effort on underpowered exploratory work [... and] carry out lotsof underpowered small studies to maximize their number of publications, eventhough this means around half will be false positives.” This result is in line withthe views of many (e.g., Bakker et al. (2012), Button et al. (2013b), Gervais et al.(2015)), and provides the basis for why statistical power in many fields has notimproved (Smaldino and McElreath, 2016) despite being highlighted as an issueover five decades ago (Cohen, 1962). In Chapter 3 we develop a similar modelfor researcher incentives and investigate the merits of various publication policies,including, in Chapter 4, the CET policy.Publication– Much of this chapter has been published as a peer-reviewed articlein the journal PLoS ONE; see Campbell and Gustafson (2018a).Code to reproduce all Figures in this Chapter– R-code is available to reproduceall figures within this chapter on the Open Science Framework (OSF) repository(DOI 10.17605/OSF.IO/UXN2Q — ARK c7605/osf.io/uxn2q).27Figure 2.1: Left panel - CET point estimates and confidence intervals. Point estimates and confidence intervals ofthirteen possible results from two-sample testing of normally distributed data are presented alongside their cor-responding conclusions. Let n = 90, D = 0.5, a1 = 0.05 and a2 = 0.10. Black points indicate point estimates;blue lines (wider intervals) represent (1-a1)% confidence intervals; and orange lines (shorter intervals) repre-sent (1-2a2)% confidence intervals. Right panel- Rejection regions. Let t1 = t⇤a1/2, t2 = t⇤a2 . The values ofµˆd = x¯1 x¯2 (on the x-axis) and s⇤ = sp(1/n1+1/n2)1/2 (on the y-axis) are obtained from the data. The threeconclusions, positive, negative, inconclusive correspond to the three areas shaded in green, blue and red respec-tively. The black lettered points correspond to the thirteen scenarios on the left panel. R-code to reproduce:http://tinyurl.com/y68r2y7u28Figure 2.2: Comparing NHST p-values with pCET values across a wide range of effect sizes for various sample sizes.Distribution of two-tailed p-values from NHST and pCET values from CET, with varying n (total sample size) andr = [0.5,0.5]. All testing is based on the assumption of equal variance. Data are the results from 10,000 MonteCarlo simulations of two-sample normally distributed data with equal variance, s2 = 1. The true difference inmeans, µd = µ1 µ2, is drawn from a uniform distribution between -2 and 2. Black points (= 1 p2) mightindicate evidence in favour of equivalence (“negative result”), whereas blue points (= p1) indicate evidence infavour of a difference (“positive result”). An “inconclusive result” would occur for all black points falling inbetween the two dashed horizontal lines at a1 = 0.05 and a2 = 0.10, (i.e for p2 a2). The format of this plotis based on Figure 2.1, “Distribution of two-tailed p-values from Student’s t-test,” from Lew (2013). R-code toreproduce: http://tinyurl.com/y5aouva329µdn0.10 0.15 0.20 0.25 0.30 0.3505001000150020002500300035004000Variance (σ2)0.75, Power1.00, Power1.25, Power0.75, Pr(Success)1.00, Pr(Success)1.25, Pr(Success)a1a2b1b2Figure 2.3: The required sample size for 90% power, or Pr(Success)=61%. Suppose the anticipated difference inpopulation means is µ˜d = 0.205, the anticipated variance is s˜2 = 1 and the equivalence margin is pre-specifiedwith D = 0.1025 (=12 µ˜d). Then, based on the desire for Pr(positive) = 90% (i.e., “power” = 0.90) (dashed bluelines) or a 61% “probability of success” (solid dark red lines) a sample size of n = 1,000 (with n1 = n2) wouldbe required. If the true variance is slightly larger than anticipated, s2 = 1.25, and the mean difference smaller,µd = 0.123, the actual sample size needed for 90% power is in fact 3,476, while the actual sample size needed forPr(success) = 61% is 1,770; see points “a1” and “a2.” On the other hand, if the true variance is slightly smallerthan anticipated, s2 = 0.75 and the mean difference greater, µd = 0.33, the actual sample size needed for 90%power is only 288, while the actual sample size needed for Pr(success|r,d,s) = 61% is 638; see points “b1” and“b2.” R-code to reproduce: http://tinyurl.com/y3np9cav301/101/313:110:110 100 1000 10000nB01JZS Bayes Factor 0.000.250.500.751.0010 100 1000 10000nB01(1+B01)Pr(H0|data) 0.00.10.20.30.40.50.60.70.80.91.010 100 1000 10000npCETpCET, Δ=0.50x1−x2 0 0.09 0.25 0.67B CAFigure 2.4: A comparison between the JZS-Bayes Factor and the pCET . Based partially on Figure 5 from Rouderet al. (2009). Left (middle; right) panel shows how the JZS-Bayes Factor (the posterior probability of H0; pCET )changes with sample size for four different observed mean differences, µˆd = x¯1 x¯2. The observed variance isconstant, s2p = 1. R-code to reproduce: http://tinyurl.com/yyubjpj831µd = 0.35 µd = 0.48 µd = 0.67µd = 0.13 µd = 0.18 µd = 0.25µd = 0 µd = 0.07 µd = 0.0910 25 50 100 250 500 1500 5000 10 25 50 100 250 500 1500 5000 10 25 50 100 250 500 1500 50000.00.20.40.60.81.00.00.20.40.60.81.00.00.20.40.60.81.0nprobabilitytestingBF CETconclusionBF postiveBF negativeBF inclonclusiveCET positiveCET negativeCET inclonclusiveFigure 2.5: The probability of obtaining each conclusion by Bayesian testing scheme (JZS-BF with fixed sample sizedesign, BF threshold of 3:1) and CET (a1 = 0.05,a2 = 0.10). Each panel displays the results of simulationswith true mean difference, µd = 0, 0.07, 0.09, 0.13, 0.18, 0.25, 0.35, 0.48, and 0.67. R-code to reproduce:https://tinyurl.com/y3sppwck32µd = 0.35 µd = 0.48 µd = 0.67µd = 0.13 µd = 0.18 µd = 0.25µd = 0 µd = 0.07 µd = 0.0910 25 50 100 250 500 1000 5000 10 25 50 100 250 500 1000 5000 10 25 50 100 250 500 1000 50000.00.10.20.30.40.50.60.70.80.91.00.00.10.20.30.40.50.60.70.80.91.00.00.10.20.30.40.50.60.70.80.91.0nprobabilityagreementoverallpositivenegativeFigure 2.6: Simulation study results: levels of agreement. The black line (“overall”) indicates the probability that theconclusion reached by CET (with a1 = 0.05, and a2 = 0.10) is in agreement with the conclusion reached by theJZS-BF Bayesian testing scheme (with JZS-BF with fixed sample size design, BF threshold of 3:1). The blue line(“positive”) indicates the probability that both methods agree with respect to whether or or not a result is positive.The crimson line (“negative”) indicates the probability that both methods agree with respect to whether or or nota result is negative. Each panel displays the results of simulations with true mean difference, µd = 0, 0.07, 0.09,0.13, 0.18, 0.25, 0.35, 0.48, and 0.67. R-code to reproduce: https://tinyurl.com/y3sppwck33µd = 0.35 µd = 0.48 µd = 0.67µd = 0.13 µd = 0.18 µd = 0.25µd = 0 µd = 0.07 µd = 0.0910 25 50 100 250 500 1500 5000 10 25 50 100 250 500 1500 5000 10 25 50 100 250 500 1500 50000.00.20.40.60.81.00.00.20.40.60.81.00.00.20.40.60.81.0nprobabilitytestingBF CETconclusionBF postiveBF negativeBF inclonclusiveCET positiveCET negativeCET inclonclusiveFigure 2.7: Simulation study results: conclusions. The probability of obtaining each conclusion by Bayesian testingscheme (JZS-BF with fixed sample size design, BF threshold of 6:1) and CET (a1 = 0.05,a2 = 0.10). Each paneldisplays the results of simulations with true mean difference, µd = 0, 0.07, 0.09, 0.13, 0.18, 0.25, 0.35, 0.48, and0.67. R-code to reproduce: https://tinyurl.com/y3sppwck34µd = 0.35 µd = 0.48 µd = 0.67µd = 0.13 µd = 0.18 µd = 0.25µd = 0 µd = 0.07 µd = 0.0910 25 50 100 250 500 1000 5000 10 25 50 100 250 500 1000 5000 10 25 50 100 250 500 1000 50000.00.10.20.30.40.50.60.70.80.91.00.00.10.20.30.40.50.60.70.80.91.00.00.10.20.30.40.50.60.70.80.91.0nprobabilityagreementoverallpositivenegativeFigure 2.8: Simulation study results: levels of agreement. The black line (“overall”) indicates the probability that theconclusion reached by CET (with a1 = 0.05, and a2 = 0.10) is in agreement with the conclusion reached by theJZS-BF Bayesian testing scheme (with JZS-BF with fixed sample size design, BF threshold of 6:1). The blue line(“positive”) indicates the probability that both methods agree with respect to whether or or not a result is positive.The crimson line (“negative”) indicates the probability that both methods agree with respect to whether or or nota result is negative. Each panel displays the results of simulations with true mean difference, µd = 0, 0.07, 0.09,0.13, 0.18, 0.25, 0.35, 0.48, and 0.67. R-code to reproduce: https://tinyurl.com/y3sppwck35Figure 2.9: Left panel - The RR publication policy. The RR policy has two central components: (1) pre-registrationand (2) the “RR commitment to publish.” Right panel - The CET publication policy. Note that the CET policywill not require a sample size calculation showing a specific level of statistical power. Instead, the researcherwill define an equivalence margin for each hypothesis test to be carried out. A study will only be published if aconclusive result is obtained. Otherwise, the journal will maintain a record of the inconclusive result. Powerpointfile to reproduce: http://tinyurl.com/y6t9qxxb36Chapter 3The world of research has goneberserk: modeling theconsequences of requiring“greater statistical stringency”for scientific publicationAs noted in the Chapter 1, many researchers have recently proposed adopting mea-sures of “greater statistical stringency,” including suggestions to require larger sam-ple sizes and to lower the highly criticized “p < 0.05” significance threshold. Instatistical terms, this represents selecting lower levels for acceptable type I andtype II error rates. The main argument against “greater statistical stringency” isthat these changes would increase the costs of a study and ultimately reduce thenumber of studies conducted.Consider the debate about lowering the significance threshold in response tothe work of Johnson (2013), who, based on the correspondence between uniformlymost powerful Bayesian tests and classical significance tests, recommends low-ering significance thresholds by a factor of 10 (e.g., from p < 0.05 to p < 0.005).Gaudart et al. (2014), voicing a common objection, contend that such a reduction in37the allowable type I error rate will result in inevitable increases to the type II errorrate. While larger sample sizes could compensate, this can be costly: “increasingthe size of clinical trials will reduce their feasibility and increase their duration”(Gaudart et al., 2014). In the view of Johnson (2014), this may not necessarilybe such a bad thing, pointing to the excess of false positives and the idea that (inthe context of clinical trials) “too many ineffective drugs are subjected to phase IIItesting [...] wast[ing] enormous human and financial resources.”More recently, a highly publicized call by over seven dozen authors to “rede-fine statistical significance” has made a similar suggestion: lower the threshold ofwhat is considered “significant” for claims of new discoveries from p < 0.05 top < 0.005 (Benjamin et al., 2018). This has prompted many familiar responses(e.g., Wei and Chen (2018) who prefer study replication over a lowered signifi-cance threshold for reducing the risk of false-positive results, and Hernandez et al.(2018) who are concerned about increased costs). In a recent article, Amrhein et al.(2017) review the arguments for and against more stringent thresholds for signifi-cance and conclude that: “[v]ery possibly, more stringent thresholds would lead toeven more results being left unpublished, enhancing publication bias. [...] [W]hileaiming at making our published claims more reliable, requesting more stringentfixed thresholds would achieve quite the opposite.”There is also substantial disagreement about suggestions to require larger sam-ple sizes. In some fields, showing that a study has a sufficient sample size (i.e.,high power) is common practice and an expected requirement for funding and/orpublication, while in others it rarely occurs. For example, Charles et al. (2009)found that 95% of randomized controlled trials (RCTs) report sample size calcu-lations. In contrast, only a tiny fraction of articles in some fields –about 3% forpsychology, and about 2% for toxicology - justify their sample size (Bosker et al.,2013, Fritz et al., 2013). The number is only marginally higher in conservationbiology at approximately 8% (Fidler et al., 2006).As discussed in the Introduction (Chapter 1), one argument against strict Sam-ple Size Requirements (SSRS), is that, once a significant finding is achieved, thesize of a study is no longer relevant; see Aycaguer and Galba´n (2013), Bacchetti(2002). Another viewpoint is that, while far from ideal, underpowered studiesshould be published since cumulatively, they can contribute to useful findings38(Walker, 1995). Others disagree and contend that small sample sizes underminethe reliability of published science (Button et al., 2013a, Dumas-Mallet et al.,2017, Nord et al., 2017). In the context of clinical trials, IntHout et al. (2016)review the many conflicting opinions about whether trials with suboptimal powerare justified and conclude that, in circumstances when evidence for efficacy can beeffectively combined across a series of trials (e.g., via meta-analysis), small samplesizes might be justified.Despite the long-running and ongoing debates on significance thresholds andsample size requirements, there has been little to no modeling of how changesto a publication policy might affect what type of studies are pursued, the incen-tive structures driving research, and ultimately, the reliability of published science.One example is Borm et al. (2009) who conclude, based on simulation studies, thatthe consequences of publication bias do not warrant the exclusion of trials withlow power. Another recent example is Higginson and Munafo` (2016), who, basedon results from an model of the “scientific ecosystem,” conclude that in order to“improve the scientific value of research” peer-reviewed publications should in-deed require larger sample sizes, lower the a significance threshold, and give “lessweight to strikingly novel findings.” Our work in this chapter aims to build uponthese modeling efforts to better inform the ongoing discussion on the reproducibil-ity of published science.The model and methodology we present seeks to add three features absentfrom the model of Higginson and Munafo` (2016). While these authors considerhow researchers balance available resources between exploratory and confirma-tory studies, this simple dichotomy does not allow for a detailed assessment of thewillingness of researchers to pursue high-risk studies (those studies that are, a pri-ori, unlikely to result in a statistically significant finding). Our approach addressesthis issue by considering a continuous spectrum of a priori risk, i.e., the “pre-studyprobability.” Secondly, Higginson and Munafo` (2016) define the “total fitness of aresearcher” (i.e., the payoff for a given research strategy) with diminishing returnsfor confirmatory studies, but not for exploratory studies. This choice, however wellintended, has problematic repercussions for their model. (Under their framework,the optimal research strategy will depend on T , an arbitrary total budget parameter.)Finally, by failing to incorporate the number of correct studies that go unpublished39within their metric for the value of scientific research, many potential downsides ofadopting measures to increase statistical stringency are ignored. Other differencesbetween our approach and previous ones will be made evident and include: con-sidering outcomes in terms of distributional differences, and specific modeling ofhow sample size requirements are implemented.The remainder of this chapter is structured as follows. In Section 3.1, we de-scribe the model and methodology proposed to evaluate different publication poli-cies. We also list a number of metrics of interest and look to recent meta-researchanalyses to calibrate our model. In Section 3.2, we use the proposed methodologyto evaluate the potential impact of lowering the significance threshold. In Section3.3, we use the proposed methodology to evaluate the potential impact of requiringlarger sample sizes. Finally, in Section 3.4, we conclude with suggestions as tohow publication policies can be defined to best improve the reliability of publishedresearch.3.1 MethodsRecently, economic models have been rather useful for evaluating proposed re-search reforms (Gall et al., 2017). However, modeling of how resources oughtto be allocated among research projects is not new. See for example the work ofGreenwald (1975), Dasgupta and Maskin (1987), McLaughlin (2011), and Millerand Ulrich (2016). As noted in the introduction, our framework for modeling thescientific ecosystem is closest in spirit to that of Higginson andMunafo` (2016) whoformulate a relationship between a researcher’s strategy and their payoff, with thestrategy involving a choice of mix between exploratory and confirmatory studies,and a choice of pursuing fewer studies with larger samples or more studies withsmaller samples.The publication process is complex and includes both objective and subjec-tive considerations of efficacy and relevance. The title of this chapter was chosenspecifically to emphasize this point (Gornitzki et al., 2015). A large, complicatedhuman process like that of scientific publication cannot be entirely reduced to met-rics and numbers: there are often financial, political and even cultural reasons for apaper being accepted or rejected for publication. With this in mind, the model pre-40sented here should not be seen as an attempt to precisely map out the peer-reviewprocess, but rather, as a useful tool for determining the consequences of imple-menting different publication policies. We also wish to emphasize that, while weuse the binary language of “true” and “false” (e.g., “true positive” and “false nega-tive”), scientific studies will never prove a theory to be true or false (Popper, 1959).Rather, studies will only test hypotheses to be either consistent or inconsistent withthe observed data.Within our model, many assumptions and simplifications are made. Most im-portantly, we assume that each researcher must make decisions consisting of onlytwo choices: what statistical power (i.e., sample size) to adopt and what “pre-studyprobability” to pursue. Before elaborating further, let us briefly discuss these twoconcepts.3.1.1 Statistical powerIncreasing statistical power by conducting studies would undeniably result in morepublished research being true. However, these improvements may only prove mod-est, given current publication guidelines. When we consider the perspectives ofboth researchers and journal editors, it is not surprising that statistical power has notimproved substantially (Lamberink et al., 2018, Smaldino and McElreath, 2016)despite being highlighted as an issue over five decades ago (Cohen, 1962).The statistical power of a study depends greatly on the study’s sample size.However, it also depends on a variety of other issues including measurement qual-ity and study design. In practice, multiple factors can influence sample size de-termination (Hazra and Gogtay, 2016, Lenth, 2001, Roos, 2017). Strictly in termsof publication prospects however, there is often little incentive to conduct high-powered studies: basic logic suggests that the likelihood of publication is onlyminimally affected by power. To illustrate, consider a large number of hypothesestested, out of which 10% are truly non null. Under the assumption that only (andall) positive results are published (which may in fact be realistic in certain fields;see Fanelli (2011)), simple analytical calculation (Ioannidis (2005)) shows that in-creasing average power from an “unacceptably low” 55% to a “respectable” 85%(at the cost of more than doubling sample size), results in only a minimal increase41in the likelihood of publication: from 10% to 12.5%. Moreover, the proportionof true findings amongst those published is only increased modestly: from 55% to64%. Indeed, a main finding of Higginson and Munafo` (2016) is that the ratio-nal strategy of a researcher is to “carry out lots of underpowered small studies tomaximize their number of publications, even though this means around half willbe false positives.” This result is in line with the views of many; see for exampleBakker et al. (2012), Button et al. (2013b) and Gervais et al. (2015).From a journal’s perspective, there is also little incentive to require larger sam-ple sizes as a requirement for publication. There are minimal consequences frompublishing false claims. Fraley and Vazire (2014) review the publication historyof six major journals in social-personality psychology and find that “journals thathave the highest impact [factor] also tend to publish studies that have smaller sam-ples.” This finding is in agreement with Szucs and Ioannidis (2017b) who concludethat, in the fields of cognitive neuroscience and psychology, journal impact factorsare negatively correlated with statistical power; see also Brembs et al. (2013).3.1.2 Pre-study probabilityWe use the term “pre-study probability” (psp) as shorthand for the a priori prob-ability that a study’s null hypothesis is false. In this sense, highly exploratoryresearch will typically have very low psp, whereas confirmatory studies will havea relatively high psp.Studies with low psp are not problematic per se. To the contrary, there areundeniable benefits to pursuing “long-shot” novel ideas that are very unlikely towork out; see Cohen (2017). While replication studies (i.e., studies with higherpsp) may be useful to a certain extent, there is little benefit in confirming a re-sult that is already widely accepted. As Button et al. (2013a) note: “As R [thepre-study odds] increases [...] the incremental value of further research decreases.”Most scientific journals no doubt take this into account in deciding what to pub-lish, with more surprising results more likely to be published. This state of affairspersists, despite recent calls for more replication studies (e.g., Moonesinghe et al.(2007)). Indeed, replication studies are still often rejected on the grounds of “lackof novelty” (Makel et al., 2012, Martin and Clarke, 2017, Yeung, 2017). As such,42For a fixed resources, T ,and a given (psp, pwr), Equationthe expected number of...True Positives published TPPUB(psp, pwr) = psp ·nS ·A ·Pr(TP)True Positives unpublished TPUN(psp, pwr) = psp ·nS · (1A) ·Pr(TP)False Negatives published FNPUB(psp, pwr) = psp ·nS ·B ·Pr(FN)False Negatives unpublished FNUN(psp, pwr) = psp ·nS · (1B) ·Pr(FN)False Positives published FPPUB(psp, pwr) = (1 psp) ·nS ·A ·Pr(FP)False Positives unpublished FPUN(psp, pwr) = (1 psp) ·nS · (1A) ·Pr(FP)True Negatives published TNPUB(psp, pwr) = (1 psp) ·nS ·B ·Pr(TN)True Negatives unpublished TNUN(psp, pwr) = (1 psp) ·nS · (1B) ·Pr(TN)Publications ENP(psp, pwr) = TPPUB+FNPUB+FPPUB+TNPUBTable 3.1: Equations for the expected number of studies (out of a total of nSstudies) for each of the eight categories; with A = probability of publica-tion for a positive result and B = probability of publication for a negativeresult. The number of studies nS, changes for different values of pwr.We have that: nS = [k+n(pwr)]1T , where n(pwr) is the required sam-ple size to obtain a power of pwr.researchers deciding which hypotheses to pursue towards publication will likelyemphasize those with lower psp.Recognize that the lower the psp, the less likely a “statistically significant”finding is to be true. As such, we are bound to a “seemingly inescapable trade-off”(Fiedler, 2017) between the novel and the reliable. Journal editors face a difficultchoice. Either publish studies that are surprising and exciting yet most likely false,or publish reliable studies which do little to advance our knowledge. Based on abelief that both ends of this spectrum are equally valuable, Higginson and Munafo`(2016) conclude that, in order to increase reliability, current incentive structuresshould be redesigned, “giving less weight to strikingly novel findings.” This is inagreement with the view of Hagen (2016) who writes: “If we are truly concernedabout scientific reproducibility, then we need to reexamine the current emphasis onnovelty and its role in the scientific process.”433.1.3 Model frameworkWe describe our model framework in 5 simple steps.(1) We assume, for simplicity, that all studies test a null hypothesis of equalpopulation means against a two-sided alternative, with a standard two-sample Stu-dent t-test (with the assumption of equal variance). Each study has an equal numberof observations per sample (n1 = n2; n= n1+n2). Furthermore, let us assume thata study can have one of only two results: (1) positive (p-value< a), or (2) negative(p-value a). Given the true effect size, µd = µ2 µ1 (the difference in popula-tion means), the common population variance, s2, and the sample size, n, we caneasily calculate the probability of a significant result, using the standard formulafor power.1Then, for a given true effect size of d or zero, we have the probability of a TruePositive (True Postive (TP)), False Negative (False Negative (FN)), False Positive(False Positive (FP)), and True Negative (True Negative (TN)), equal to: Pr(TP) =Pr(Positive|µd = d ), Pr(FN)=Pr(Negative|µd = d ), Pr(FP)=Pr(Positive|µd =0), and Pr(TN) = Pr(Negative|µd = 0), respectively.(2) Next we consider a large number of studies, nS, each with a total samplesize of n. Of these nS studies, only a fraction, psp (where psp is the pre-studyprobability), have a true effect size of µd = d . The remaining (1 psp) · nS stud-ies, have µd = 0. Note that for a given sample size, n, these nS studies are each“powered” at level pwr= Pr(TP). Throughout this chapter, we focus on scenarioswhere s2 = 1 and d = 0.20 or d = 0.43. These choices reference the analyses ofLamberink et al. (2018) and Richard et al. (2003). Lamberink et al. (2018), basedon an empirical analysis of over one hundred-thousand clinical trials conductedbetween 1975 and 2014, estimate that the median effect size of clinical trials is ap-proximately a Cohen’s d = 0.20 (this corresponds to a scenario where s2 = 1 andd = 0.20). In a similar analysis based on data from over 25,000 social/personality1The probability of a positive result is equal to:Pr(Positive|µd) = (1Fn2,µd/s ⇤(t⇤a/2))+Fn2,µd/s ⇤(t⇤a/2),where t⇤a/2 is the upper 100·a2 -th percentile of the t-distribution with n 2 degrees of freedom,s⇤ = s{(1/n1+1/n2)}1/2, and Fd f ,ncp(x) is the cdf of the non-central t distribution with d f degreesof freedom and non-centrality parameter ncp.44studies, Richard et al. (2003) estimate that the mean effect size in social psychol-ogy research is of a Cohen’s d = 0.43 (this corresponds to a scenario where s2 = 1and d = 0.43).(3) We also label each study as either published (PUB) or unpublished (UN)for a total of 8 distinct categories (= 2 (positive, negative) x 2 (true and false) x 2(published and unpublished)). One can easily determine the expected number ofstudies (out of a total of nS studies) in each category. Table 3.1 lists the equationsfor each of the eight categories with A equal to the probability of publication for apositive result, and B equal to the probability of publication for a negative result.Throughout this chapter, we will always assume that only positive studies arepublished, hence, we set B= 0. Initially, we will let the probability of publicationof a positive study, A, depend on the psp according to the function A= (1 psp)m,where m 2 (0,•) is a tuning parameter. This simple, decreasing function of psprepresents a system in which positive results with lower psp are more likely to bepublished on the basis of novelty. A higher value of m indicates that higher pspstudies (i.e., low risk hypotheses) are less likely to be published. Consequently, ahigher value of m implies a lower overall publication rate. In Section 3.3, we willdefine A differently, such that the probability of publication depends on the valuesof both psp and pwr.(4) We determine the total number of studies, nS, based on three parameters:T , the total resources available (in units of observations sampled); k, the fixed costper study (also expressed in equivalent units of observations sampled); and n, thetotal sample size per study. Consequently, as in Higginson and Munafo` (2016),nS = [k+n(pwr)]1T , where n(pwr) is the required sample size to obtain a powerof pwr. Then for any given level of power, pwr, we can easily obtain the necessarysample size per study, n, (using established sample size formula for two-samplet-tests), and a resulting total number of studies, nS. Keep in mind that nS, thenumber of studies, is a function of pwr. When necessary, we take T = 100,000.However, note that when comparing the outcomes of different publication policies,this choice is entirely irrelevant. Consider that when k is small, the cost of data,relative to the total cost of a study, is large. Conversely, when k is large, the relativecost of increasing the study sample size is small. For example, suppose k = 20(with d = 0.20, and a = 0.05), then increasing the power from 0.55 to 0.85, implies45doubling the total cost of the study. If instead k = 500 (all else being equal), thisincrease in power corresponds to only a 50% increase in the total study cost.2(5) Finally, let us define a “research strategy” to be a given pair of values for(psp, pwr) within ([0,1] x [a ,1]). Then, for a given research strategy we can easilycalculate the total Expected Number of Publications (ENP):ENP(psp, pwr) = TPPUB+FPPUB+TNPUB+FNPUB. (3.1)Figure 3.1 illustrates how ENP is calculated for different values of psp and pwr,with fixed a = 0.05, k = 100 and m = 3. With this set-up in hand, suppose nowa researcher pursues -consciously or unconsciously- strategies that maximize theexpected number of publications (Charlton and Andras, 2006). This may not bean entirely unreasonable assumption considering the influence of Goodhart’s Law(“Any observed statistical regularity will tend to collapse once pressure is placedupon it for control purposes”) on academia; see Fire and Guestrin (2018) and vanDijk et al. (2014).Figure 3.2 shows how ENP changes over a range of values of psp and pwrand under different fixed values for k and m. (The central panel of Figure 3.2(k = 100, m = 3) corresponds to Figure 3.1.) Depending on k and m, the value of(psp, pwr) that maximizes ENP can change substantially. With larger k, higher-powered strategies will yield a greater ENP; with smaller m, optimal strategies arethose with higher psp. It is interesting to observe how the optimal strategy changesunder different scenarios. However, it may be more informative to consider a dis-tribution of preferred strategies. This may also be a more realistic approach. Whilerational researchers may be drawn toward optimal strategies, surely scientists arenot willing or able to precisely identify these.Taking the incentive to publish as a starting point, we assume scientists aremore likely to spend resources on studies whose psp and pwr produce a higherexpected number of publications. So we consider a distribution of resource allo-2 R-code to make this calculation:(pwr.t.test(d=0.20, power=0.85, type“two.sample”)$n*2+20)/(pwr.t.test(d=0.20, power=0.55,type=“two.sample”)$n*2+20)(pwr.t.test(d=0.20, power=0.85, type=“two.sample”)$n*2+500)/(pwr.t.test(d=0.20, power=0.55,type=“two.sample”)$n*2+500)46cation across study characteristics whose density is proportional to the expectednumber of publications, i.e., fRES(psp, pwr) µ ENP(psp, pwr). We emphasizethat from a funder’s viewpoint, fRES() describes where dollars are being spent onthe (psp, pwr) plane. For example, consider the nine strategies illustrated in Fig-ure 3.1. Given the relative ENP values, we expect the expenditure on (psp, pwr) =(0.2,0.2) studies to be 18 times greater than that on (psp, pwr) = (0.8,0.5) studies.In the Chapter 3 Appendix, we give expressions for two distributions that fol-low on from fRES(). The first is fATM(psp, pwr) which describes where attemptedstudies (not resources) fall on the (psp, pwr) plane, presuming that resources areallocated according to fRES(). Since higher-powered studies are most costly, fRES()and fATM() are not the same. Hence fATM() describes what kinds of studies are at-tempted more frequently.Of course not all attempted studies are published. Thus in the Chapter 3 Ap-pendix we also give an expression for fPUB(pwp, pwr), the density of (psp, pwr)amongst published studies that is implied when fATM() describes the character-istics of attempted studies. Note then that each of fRES(), fATM(), and fPUB()describes relatively favoured and disfavoured (psp, pwr) combinations. However,the distinction between the three is important. While fRES() describes resourcedeployment, fATM() describes the resulting constellation of attempted studies, andfPUB() describes the resulting constellation of published studies.Armed with fRES(), fATM(), and fPUB(), we can investigate the properties ofa given scientific ecosystem, and how these properties vary across ecosystems.Specifically, an ecosystem is specified by choices of a , k, m, A and B. For anyspecification, properties of the three distributions are readily computed via two-dimensional numerical integration using a fine (200 by 200) grid of (psp, pwr)values. We do not require any simulation to obtain our results (except for thoseresults relating to the accuracy of published research, see details in the Chapter 3Appendix). To illustrate, consider the much coarser 3 by 3 grid in Figure 3.1 whereENP could be easily calculated for each value of (psp, pwr).473.1.4 Ecosystem metricsWe will evaluate the merits of different publication policies on the basis of thefollowing six ecosystem metrics of interest. In the Chapter 3 Appendix, we providefurther details on these metrics and introduce some compact notation that will beuseful for their calculation from fRES() and its by-products.1. The Publication Rate (PubR). The ratio of the number of published studiesover the number of attempted studies is of evident interest.2. The Reliability (REL). The proportion of published findings that are correctis a highly relevant metric for the scientific ecosystem. Ideally, we wish tosee a literature with as few false positives as possible.3. The Breakthrough Discoveries (DSCV ). The ability of the scientific ecosys-tem to produce breakthrough findings is an important attribute. Here, a truebreakthrough result is defined as a true positive and published study thatresults from a psp value below a threshold of 0.05. Note that we will cal-culate the total number of true breakthrough discoveries, DSCV , for eachecosystem for T units of resources. We will also note the reliability of break-through discoveries (DREL), the proportion of positive and published studieswith psp< 0.05 that are true.4. The Balance Between Exploration and Confirmation (IQpspPUB). Theinterquartile range of the distribution of psp values amongst published stud-ies provides an assessment of the much discussed balance between exploratoryand confirmatory research; see Sakaluk (2016) and Kimmelman et al. (2014).5. The Median Power of Published Studies (MpwrPUB). As mentioned ear-lier, higher powered research leads to fewer yet more reliable publications.6. The Accuracy of Published Research. We will report the mean absoluteestimated effect size in published studies, |µˆd |PUB, as well as the proportionof published studies for which the estimated effect size is negative (µˆdnegPUB).It is well known that, due to a combination of publication bias and low pow-ered studies, published effect sizes will be inflated; see Ioannidis (2008) and48Lane and Dunlap (1978). In addition, we will report the exaggeration ratio(the expected Type M error), and the Type S error rate (the proportion ofpublished studies with an error in sign); see Gelman and Carlin (2014).3.1.5 Model calibrationLet us begin by assessing whether or not our basic model framework is well cali-brated. Are our modeling choices (specifically the choice of values to consider forparameters k and m) at all consistent with what is known about real conditions? Inorder to answer this important question, we considered a large number of differentscenarios in which a is held fixed at 0.05, and both the fixed-cost parameter, k, andthe novelty parameter, m, take various values; see Table 3.2. To better understandhow estimates for the metrics of interest are obtained, see detailed derivations inthe Chapter 3 Appendix.There is a recognized lack of empirical research measuring the relationshipbetween experimental costs and sample sizes (Roos, 2017), and it is even moredifficult to empirically access appropriate values for m. However, we were ableto choose suitable values for m (= 1, 3, and 9) and k (= 20, 100, and 500) basedon how our model outputs compared with empirical estimates in the meta-researchliterature. Specifically, values were chosen to exhibit a sufficiently wide range andso that (1) the reliability, (2) the power, and (3) the publication rate agreed, atleast to a certain extent, with what has been observed empirically. Consider thefollowing.1. Due to the success of several recent large-scale reproducibility projects, weare able to obtain reasonable approximations for the reliability of publishedresearch in different scientific fields. For example, Begley and Ioannidis(2015) estimate that the reliability (i.e., true-positive rate) of social scienceexperiments published in the journals Nature and Science is approximately67%. Based on the large-scale reproducibility project of OpenScienceCol-laboration (2015), Wilson and Wixted (2018) estimate that the reliabilityof published studies in well-respected social-psychology journals is approx-imately 49%, and in cognitive psychology journals, approximately 81%.However, these estimates are at best only rough approximations since the49studies replicated in the OSC 2015 were not a random sample of psychologystudies; see Johnson et al. (2017) for a more in-depth analysis. For the cho-sen values of k and m, our model produces reliability measures ranging from15% to 83%, for d = 0.2, and from 25% to 86%, for d = 0.43. Note that thiswide range of values includes some reliability percentages that seem ratherlow (e.g., 15% and 25%).2. There have also been a number of recent attempts to estimate the typicalpower of published research using meta-analytic effect sizes. For exam-ple, Button et al. (2013b) estimate that the median statistical power in pub-lished neuroscience is approximately 21% and Dumas-Mallet et al. (2017),using a similar methodology, conclude that the median statistical power inthe biomedical sciences ranges from about 9% to 30% depending on thedisease under study. Most recently, Lamberink et al. (2018) analyzed theliterature of RCTs and determined that the median power is approximately9% overall (and about 20% in the subset of RCTs included in significantmeta-analyses). For the chosen values of k and m, our model produces asimulated published literature with median power ranging from 9% to 60%,for d = 0.2, and from 21% to 71%, for d = 0.43.3. While the publication rate numbers we obtain may appear rather low, rang-ing from 3% to 13%, consider that Siler et al. (2015), in a systematic reviewof manuscripts submitted to three leading medical journals, observed a publi-cation rate of 6.2%. In a review of top psychology journals, Lee and Schunn(2011) found that acceptance rates ranged between 14% and 32%. It is im-portant to recognize that these empirical estimates are calculated based onthe subset of studies that are submitted for publication. If researchers areself-selecting their “most publishable” work for submission to journals, thenthese estimates will substantially overestimate the true publication rate of allstudies. (Note that authors may also simply re-submit their work to anotherjournal if initially rejected). Song et al. (2014) estimate that about 85% ofunpublished studies are unpublished for the simple reason that they are neversubmitted for publication. Our model also assumes that negative studies arenever published (B= 0) and this (not entirely realistic) assumption also con-50tributes to the low publication rate numbers we obtain. Senn (2012) providessubstantial insights on both these related issues: the level of publication bias,and how researchers may choose to submit research based on the perceivedprobability of acceptance. Note that while the scenarios with higher publi-cation rates may appear more plausible, these scenarios have very (perhapsunreasonably) high values for reliability and median power.Regardless of the values chosen for k and m, the overall trends for our metricsof interest will be similar. As seen in Table 3.2, as k increases, the reliability ofpublished research (REL) will increase, while the number of true breakthrough dis-coveries (DSCV ) decreases. As m increases, reliability (REL) decreases while thenumber of true breakthrough discoveries (DSCV ) increases. This strikes us as bothreasonable and realistic. When a greater emphasis is placed on novelty (i.e., whenm is larger), there will be a greater number of smaller, high-risk studies published.While these publications are less reliable, they are more numerous and more likelyto produce a breakthrough finding. When data is more affordable relative to over-all study cost (i.e., when k is larger), there will be fewer, larger studies and asa result the published literature will be more reliable but less likely to produce abreakthrough finding.Before moving on, let us also briefly consider how the accuracy of publishedeffect sizes changes with k and m; see Table 3.3. Overall, we see that the averageof the absolute published effect size, |µˆd |PUB, is always much larger than d . Fur-thermore, we see that a large proportion of published effect sizes are negative. Thisis to be expected when psp values are small and many published studies are falsepositives. For the subset of studies that are positive, published, and true, we seethat the Type M and the Type S errors are greatest when k is small, m is large, andd is small. This is the same configuration that produces low powered publications.Indeed, Type M and Type S errors are known to occur more frequently when poweris low (Gelman and Carlin, 2014).51d k m PubR REL DSCV DREL MpwrPUB IQpspPUB [q25th,q75th]0.20 20 1 0.06 0.62 0.17 0.07 0.20 0.35 [0.19 , 0.54]0.20 20 3 0.03 0.36 0.38 0.07 0.12 0.19 [0.07 , 0.26]0.20 20 9 0.03 0.15 0.86 0.06 0.09 0.07 [0.02 , 0.09]0.20 100 1 0.08 0.75 0.09 0.12 0.42 0.33 [0.25 , 0.58]0.20 100 3 0.04 0.52 0.22 0.12 0.32 0.20 [0.09 , 0.29]0.20 100 9 0.03 0.25 0.51 0.10 0.21 0.08 [0.03 , 0.10]0.20 500 1 0.11 0.83 0.04 0.18 0.60 0.31 [0.29 , 0.60]0.20 500 3 0.05 0.65 0.10 0.18 0.55 0.21 [0.12 , 0.33]0.20 500 9 0.03 0.36 0.26 0.16 0.44 0.09 [0.03 , 0.12]0.43 20 1 0.08 0.76 0.42 0.12 0.42 0.33 [0.25 , 0.58]0.43 20 3 0.04 0.53 1.00 0.12 0.32 0.20 [0.09 , 0.29]0.43 20 9 0.03 0.25 2.37 0.11 0.21 0.08 [0.03 , 0.10]0.43 100 1 0.11 0.83 0.21 0.18 0.59 0.31 [0.29 , 0.60]0.43 100 3 0.05 0.65 0.50 0.18 0.54 0.21 [0.12 , 0.33]0.43 100 9 0.03 0.36 1.24 0.16 0.43 0.08 [0.03 , 0.11]0.43 500 1 0.13 0.86 0.07 0.23 0.71 0.31 [0.30 , 0.61]0.43 500 3 0.06 0.71 0.17 0.22 0.68 0.21 [0.13 , 0.34]0.43 500 9 0.03 0.43 0.44 0.20 0.60 0.09 [0.03 , 0.12]Table 3.2: For ecosystems defined by fixed a = 0.05, A = (1 psp)m, andB = 0, and varying values of k, m, and d , the table lists estimates forthe metrics of interest. These include the publication rate (PubR), thereliability (REL), the number of true breakthrough discoveries (DSCV ),the median power of published studies (MpwrPUB), and the interquar-tile range of the distribution of psp values amongst published studies(IQpspPUB), with corresponding 25th and 75th quartiles. Additionally,the table lists the reliability amongst the subset of published studies forwhich psp< 0.05 (DREL). R-code to reproduce: https://goo.gl/ZcszZz3.2 The effects of adopting lower significance thresholdsIn this section, we investigate the impact of adopting lower significance thresh-olds. How might the published literature be different if the a = 0.05 thresholdwas lowered? For positive studies (i.e., studies for which the null is rejected), wewill assume that the sample size of a study does not affect the likelihood of pub-lication (at least not directly) and that studies with lower psp are more likely tobe published according to the simple function introduced earlier, A = (1 psp)m.We compute the metrics of interest for 54 different ecosystems. Each ecosystem isuniquely defined with either d = 0.2 or d = 0.43, with one of three possible valuesfor m (= 1, 3, 9), with one of three possible values for k (= 20, 100, 500), and most52d m k MpwrPUB µˆdnegPUB |µˆd |PUB Type M err. Type S err.0.20 1 20 0.20 0.21 0.60 2.42 0.030.20 3 20 0.12 0.31 0.72 2.70 0.040.20 9 20 0.09 0.43 0.80 2.77 0.040.20 1 100 0.42 0.14 0.37 1.68 0.010.20 3 100 0.32 0.23 0.45 1.81 0.010.20 9 100 0.21 0.37 0.51 1.92 0.010.20 1 500 0.60 0.09 0.29 1.40 0.000.20 3 500 0.55 0.17 0.31 1.40 0.000.20 9 500 0.44 0.33 0.35 1.48 0.000.43 1 20 0.42 0.14 0.75 1.63 0.010.43 3 20 0.32 0.21 0.80 1.65 0.000.43 9 20 0.21 0.39 0.90 1.74 0.010.43 1 100 0.59 0.09 0.60 1.36 0.000.43 3 100 0.54 0.18 0.64 1.43 0.000.43 9 100 0.43 0.32 0.69 1.47 0.000.43 1 500 0.71 0.08 0.54 1.29 0.000.43 3 500 0.68 0.17 0.56 1.29 0.000.43 9 500 0.60 0.28 0.57 1.30 0.00Table 3.3: For ecosystems defined by fixed a = 0.05, A = (1 psp)m, andB = 0, and varying values of k, m, and d , the table lists estimates forthe metrics of interest related to the accuracy of the published effectssizes. These include the median power of published studies (MpwrPUB),the mean absolute published effect size (|µˆd |PUB), the proportion of pub-lished effect sizes that are negative (µˆdnegPUB), the “Type S” error, and the“Type M” error. R-code to reproduce: https://goo.gl/aqM5qfimportantly, with one of three possible values for the a significance threshold (=0.005, 0.020, 0.050).3.2.1 ResultsIn Table 3.4, we note how the various metrics change with a = 0.005 relativeto a = 0.05, by reporting ratios. While we focus our discussion specifically onscenarios for which d = 0.20, the results for d = 0.43 are very similar. Figures3.5 - 3.13 plot the complete results, including those for a = 0.020. Based on ourresults, we can make the following conclusions on the impact of adopting a lower,more stringent, significance threshold.1. Reliability is substantially increased with a lower threshold. Based on ourresults, comparing a = 0.005 to a = 0.05, the impact on REL is greatest53Ratio of ecosystem with a = 0.005 to ecosystem with a = 0.05:d k m PubR REL DSCV DREL MpwrPUB IQpspPUB0.20 20 1 1.14 1.58 0.12 7.57 2.73 0.830.20 20 3 0.71 2.58 0.15 7.68 4.57 1.050.20 20 9 0.32 5.38 0.21 8.01 5.11 1.460.20 100 1 0.99 1.30 0.19 5.47 1.37 0.860.20 100 3 0.73 1.81 0.23 5.57 1.77 0.950.20 100 9 0.39 3.47 0.31 5.89 2.52 1.270.20 500 1 0.91 1.18 0.28 4.06 1.07 0.890.20 500 3 0.75 1.48 0.33 4.14 1.17 0.950.20 500 9 0.46 2.49 0.43 4.40 1.42 1.120.43 20 1 0.99 1.30 0.19 5.49 1.36 0.860.43 20 3 0.74 1.81 0.22 5.59 1.78 0.950.43 20 9 0.39 3.46 0.31 5.91 2.55 1.270.43 100 1 0.92 1.18 0.28 4.10 1.08 0.890.43 100 3 0.75 1.49 0.32 4.18 1.19 0.950.43 100 9 0.46 2.51 0.42 4.44 1.44 1.190.43 500 1 0.92 1.14 0.37 3.49 1.02 0.900.43 500 3 0.78 1.37 0.42 3.55 1.07 0.950.43 500 9 0.51 2.15 0.55 3.78 1.19 1.12Table 3.4: A comparison between ecosystems with a = 0.05 and ecosystemswith a = 0.005. For all ecosystems, we have A = (1 psp)m, B = 0,and varying values of k, m, and d . The table lists estimates for the ra-tios (i.e., for a given metric: Xa=0.005/Xa=0.05) of the publication rate(PubR), the reliability (REL), the number of true breakthrough discover-ies (DSCV ), the median power of published studies(MpwrPUB), and theinterquartile range of the distribution of psp values amongst publishedstudies (IQpspPUB). R-code to reproduce: https://goo.gl/iKkDsHwhen d is small, k is small, and m is large, see Table 3.4. This is due tothe fact that with a lower significance threshold policy, attempted studiesare typically of higher power (particularly so when k is small) and of higherpre-study probability. To understand why there would be more studies withhigher power and higher pre-study probabilities, consider that with a lower athreshold, the probability of obtaining a significant result (i.e., p-value < a)“by chance” is substantially reduced. To maximize one’s expected numberof publications (ENP) when a = 0.005, it is a much better strategy to pursuehigher pwr and higher psp strategies. This is particularly true when theeffect size, d , is small.542. A disadvantage of the lower threshold is that the number of breakthroughdiscoveries is substantially lower, see Table 3.4 and Figure 3.4. This is dueto the fact that with a lower a threshold, fewer “high risk” (i.e., low psp)studies are attempted. The chance that the p-value will fall below a , whena is lowered to 0.005, is already risky enough. The DREL ratios we obtainsuggest that, while fewer low psp studies will be published, these will bemuch more reliable.3. When increasing study power is less costly relative to the total cost of astudy (i.e., when k is larger, or when d is larger), the benefit of loweringthe significance threshold (increased REL) is somewhat smaller. However,the downside (decreased DSCV ) is substantially smaller, see Figures 3.4 and3.5. This suggests that a policy of lowering the significance threshold wouldperhaps be best suited in a field of research in which increasing one’s samplesize is less burdensome. This nuance recalls the suggestion of Ioannidis et al.(2013): “Instead of trying to fit all studies to traditionally acceptable type Iand type II errors, it may be preferable for investigators to select type I andtype II error pairs that are optimal for the truly important outcomes and forclinically or biologically meaningful effect sizes.”4. When novelty is more of a requirement for publication (i.e., when m islarger), the benefit of lowering the significance threshold is larger and thedownside smaller. This result is due to the fact that a smaller a will incen-tivize researchers to allocate resources in the direction towards either higher-powered or higher psp studies (i.e., away from the South-West corner of theplots in Figure 3.2). There is a choice between moving towards higher pwr(North) or towards higher psp (East), and different costs associated witheach direction. This suggests that for a lower significance threshold pol-icy to be most effective, editors should also adopt, in conjunction, stricterrequirements for research novelty. To illustrate, consider three ecosystemsof potential interest with their estimated REL, DSCV and PubR metrics (allwith d = 0.2):(1) The baseline defined by a = 0.05, m= 3, and k = 500 with:REL= 0.650, DSCV = 0.105, and PubR= 0.054;55(2) the alternative defined by a = 0.005, m= 3, and k = 500 with:REL= 0.963, DSCV = 0.034, and PubR =0.040; and(3) the suggested defined by a = 0.005, m= 9, and k = 500 with:REL= 0.898, DSCV = 0.112, and PubR =0.015.Note that while the suggested has high REL and relatively high DSCV , thePubR is substantially reduced.5. As mentioned earlier, the balance between exploratory and confirmatory re-search is an important aspect of a scientific ecosystem. The results showthat the interquartile range for psp does not change substantially with a . Assuch, we could conclude that even with a much lower significance thresh-old, there will still be a wide range of studies attempted in terms of theirpsp. However, psp values do tend to be substantially higher with smaller a .As such, we should expect that, with smaller a , research will move towardsmore confirmatory, and less exploratory studies.6. With a lower a significance threshold, the published effect sizes are muchmore accurate. When a = 0.005, the Type M error is reduced to at most 1.4(i.e., effect size estimates are inflated by at most 40%) and Type S error isnegligible; see the top-left (“d = 0.2” and “without SSR”) panels of Figures3.12 and 3.13. These reductions are greatest when data are relatively expen-sive and when the true effect size is small (i.e., when k = 20 and d = 0.20).3.3 The effects of strict a priori power calculationrequirementsIn this section, we investigate the effects of requiring “sufficient” sample sizes.In practical terms, this means adopting publication policies that require studiesto show a priori power calculations indicating that sample sizes are “sufficientlylarge” to achieve the desired level of statistical power, typically 80%. Whereasbefore, the chance of publishing a positive study in our framework depended onlyon psp, we now set the probability of publication, A, to also depend on power.In doing so however, we wish to acknowledge the fact that a priori sample sizeclaims are often “wildly optimistic” (Bland, 2009) (for many reasons as discussed56in Chapter 1).To take into account the problematic nature of a priori power calculations,our model is defined such that possessing only 50% power substantially reduces,but does not eliminate, the chance of publication. For studies which really dohave 80% power and higher, there will be no notable reduction in the probabilityof publication. A convenient choice of function to represent a journal policy ofrequiring an a priori sample size justification defines A as a continuous function ofboth psp and pwr:A =(1 psp)m1+ expn(log19) pwrc50c95c50o . (3.2)So A is reduced more when power is lower, with the extent of the reduction pa-rameterized by (c50,c95). Specifically, c95 is the value for pwr at which the multi-plicative reduction is near negligible (factor of 0.95), while c50 is the value for pwrat which the multiplicative reduction is a factor of 0.5. We conduct our experi-mentation using c50,c95 = (0.5,0.8), with the following rationale. If a journal doesrequire an a priori sample size justification, a claim of 80% power is the typicalrequirement. Hence a study which really attains 80% power is not likely to sufferin its quest for publication, motivating the choice c95 = 0.8.We calculated the metrics of interest for the same 54 different ecosystems as inSection 3.2, with our new definition of A. To contrast these ecosystems with thosediscussed in the previous section, we refer to these ecosystems as “with SSR”(sample size requirements).3.3.1 ResultsBased on the results in Table 3.5, we can make the following main conclusionson the measurable consequences of adopting a journal policy requiring an a priorisample size justification.1. With SSR, we observed much higher powered studies, see Figure 3.10. Re-liability is also increased, particularly when novelty is highly prized (m islarge) and effect sizes are small. This result is as expected. As soon as hav-57Ratio of ecosystem with SSR to ecosystem without for: .d k m PubR REL DSCV DREL MpwrPUB IQpspPUB0.20 20 1 1.46 1.44 0.32 3.97 3.57 0.870.20 20 3 1.04 2.08 0.35 4.01 6.13 1.080.20 20 9 0.65 3.33 0.41 4.12 7.61 1.380.20 100 1 1.21 1.18 0.53 2.40 1.74 0.920.20 100 3 1.03 1.44 0.56 2.42 2.25 1.000.20 100 9 0.77 2.03 0.63 2.48 3.33 1.200.20 500 1 1.08 1.07 0.75 1.63 1.27 0.950.20 500 3 1.00 1.18 0.79 1.64 1.39 1.000.20 500 9 0.85 1.44 0.85 1.67 1.68 1.060.43 20 1 1.21 1.18 0.53 2.39 1.72 0.920.43 20 3 1.03 1.44 0.56 2.41 2.25 1.000.43 20 9 0.77 2.02 0.63 2.47 3.33 1.200.43 100 1 1.08 1.07 0.74 1.65 1.29 0.950.43 100 3 1.00 1.19 0.78 1.66 1.41 1.000.43 100 9 0.85 1.45 0.85 1.69 1.72 1.120.43 500 1 1.04 1.04 0.89 1.38 1.14 0.970.43 500 3 1.00 1.10 0.92 1.38 1.19 1.000.43 500 9 0.90 1.25 0.98 1.40 1.32 1.12Table 3.5: A comparison between ecosystems with a sample size requirementand ecosystems without. For ecosystems with sample size requirement(SSR)s, A is defined by equation 3.2 (with c50 = 0.5 and c95 = 0.8), B=0, a = 0.05, and varying values of d , k and m. R-code to reproduce:https://goo.gl/4J9NyVing a small sample size jeopardizes the probability of publication, it is in theresearcher’s best interest to conduct higher-powered studies.2. Requiring “sufficient sample sizes” for publication can be quite detrimentalin terms of the number of breakthrough discoveries. The impact on DSCVis greatest when d , k, and m are all small; see Table 3.5. The DREL numbersshow that reliability is particularly increased for low psp publications.3. In conjunction with requiring larger sample sizes, it may be wise to placegreater emphasis on research novelty. As in Section 3.2, such a combinedapproach could see an increase in reliability with only a limited decrease indiscovery. This trade-off is most beneficial when k is small. Consider threeecosystems of potential interest (all with a = 0.05, and d = 0.2) with theirestimated REL, DSCV and PubR metrics:58(1) The baseline defined by m = 3, k = 100, and without SSR; with REL =0.524, DSCV = 0.216, PubR= 0.043;(2) the alternative defined by m = 3, k = 100 and with SSR; with REL =0.757, DSCV = 0.122, PubR= 0.044; and(3) the suggested defined by m = 6, k = 100 and with SSR; with REL =0.502, DSCV = 0.325, PubR= 0.022.Note that while the suggested has both higher REL and higher DSCV thanthe baseline, the PubR is reduced.4. With SSR, the Type M error is at most 1.2 (i.e., effect size estimates areinflated by at most 20%) and Type S error is negligible; see Figures 3.12 and3.13.3.3.2 In tandem: The effects of adopting both a lower significancethreshold and a power requirementWe are also curious as to whether lowering the significance threshold in additionto requiring larger sample sizes would carry any additional benefits relative to eachpolicy innovation on its own. See Table 3.6 for results, and consider two mainfindings:1. For ecosystems with SSR, the distribution over (PSP, PWR) of publishedstudies does not change dramatically when a is lowered. Figure 3.3 showsthe density describing the characteristics of published studies, fPUB(psp, pwr),for the four policies of interest, with k= 500 and m= 3: (1) a = 0.05, with-out SSR; (2) a = 0.05, with SSR; (3) a = 0.005, without SSR; and (4)a = 0.005, with SSR. The difference between the densities in (2) and (4) isprimarily a matter of a shift in psp.2. As expected, lowering the significance threshold further increases reliabilityand the number of breakthrough discoveries is further decreased; see Table3.6, and Figures 3.7 and 3.4.59Ratio of a = 0.005, SSR ecosystem to a = 0.05, SSR ecosystem for:d k m PubR REL DSCV DREL MpwrPUB IQpspPUB0.20 20 1 1.77 1.61 0.09 11.83 3.77 0.830.20 20 3 1.14 2.69 0.10 12.16 6.57 1.030.20 20 9 0.53 6.31 0.15 13.23 8.33 1.460.20 100 1 1.29 1.31 0.15 6.99 1.81 0.860.20 100 3 0.97 1.86 0.17 7.17 2.38 0.950.20 100 9 0.52 3.78 0.25 7.77 3.60 1.330.20 500 1 1.03 1.19 0.24 4.57 1.29 0.900.20 500 3 0.85 1.50 0.28 4.68 1.43 0.930.20 500 9 0.52 2.60 0.38 5.03 1.76 1.180.43 20 1 1.29 1.31 0.15 6.97 1.79 0.860.43 20 3 0.97 1.86 0.17 7.15 2.38 0.950.43 20 9 0.52 3.77 0.25 7.74 3.60 1.330.43 100 1 1.04 1.19 0.24 4.63 1.31 0.900.43 100 3 0.85 1.51 0.28 4.74 1.44 0.930.43 100 9 0.52 2.62 0.37 5.09 1.80 1.250.43 500 1 0.97 1.15 0.36 3.72 1.15 0.920.43 500 3 0.83 1.38 0.41 3.80 1.21 0.950.43 500 9 0.54 2.20 0.53 4.07 1.35 1.18Table 3.6: A comparison between ecosystems with both a sample size re-quirement (SSR) and a = 0.005 and ecosystems without a sample sizerequirement (SSR) and a = 0.05. For those with a sample size require-ment, A is defined by equation 3.2 (with c50 = 0.5 and c95 = 0.8). Forthose without a sample size requirement, A = (1 psp)m. For all wehave B = 0, and varying values of d , k and m as indicated. R-code toreproduce at: http://tinyurl.com/y5y5cqs43.4 ConclusionGoing forward, it is important to recognize that current norms for Type 1 and Type2 error levels have been driven (almost) entirely by tradition and inertia rather thancareful coherent planning and result-driven decisions (Hubbard and Bayarri, 2003).Hence, improvements should be possible.In contrast to Amrhein et al. (2017) who suggest that a more stringent a thresh-old will lead to published science being less reliable, our results suggest otherwise.However, just as Amrhein et al. (2017) contend, our results indicate that the pub-lication rate will end up being substantially lower with a smaller a . While goingfrom p < 0.05 to p < 0.005 may be beneficial to published science in terms ofreliability, we caution that there may be a large cost in terms of fewer true break-60through discoveries. One must ask whether a large reduction in novel discoveries isan acceptable price to pay for a large increase in their reliability (and the reliabilityof all published literature)? Importantly, and somewhat unexpectedly, our resultssuggest that this can be mitigated (to some degree) by adopting a greater emphasison research novelty. In practice, implementing stricter requirements for researchnovelty, could be accomplished by greater editorial emphasis on “surprising re-sults” as a requisite for publication. This approach however, might be difficult toachieve unless one is willing to accept a much lower publication rate. In sum-mary, publishing lessmay be the necessary price to pay for obtaining more reliablescience.Recently, some have suggested that researchers choose (and justify) an “opti-mal” value for a , for each unique study; see Mudge et al. (2012), Ioannidis et al.(2013) and Lakens et al. (2018a). Each study within a journal would thereby havea different set of criteria. This is a most interesting idea and there are persuasivearguments in favour of such an approach. Still, it is difficult to anticipate how sucha policy would play out in practice and how the research incentive structure wouldchange in response. A model, like the one presented here, but with the a thresholdset to vary with psp, could provide useful insight. Perhaps different standards forexploratory studies and confirmatory studies are warranted.We are also cautious about greater sample size requirements. In summary, wefound that the impacts of adopting a sample size requirement policy are similar tothe impacts of lowering the a significance threshold. While improving reliability,requiring studies to show “sufficient power” will severely limit novel discoveriesin fields where acquiring data is expensive. Is it beneficial for editors and review-ers to consider whether a study has “sufficient power”? How much should thesecriteria influence publication decisions? Answers to these questions are not at allobvious. Again, using the methodology introduced, we suggest that adopting agreater emphasis on research novelty may mitigate, to a certain extent, some of thedownside of adopting greater sample size requirements at the cost of lowering theoverall number of published studies. Given that, as a result of publication bias, itcan often be better to discard 90% of published results for meta-analytical purposes(see Stanley et al. (2010)), this may be an approach worth considering.Our main recommendation is that, before adopting any (radical) policy changes,61we should take a moment to carefully consider, and model how, these proposedchanges might impact outcomes. In Chapter 4, we will adapt the model andmethodology presented in this Chapter to evaluate the pros and cons of the RRpolicy, the BF-RR policy and the novel CET policy.Publication– Note that much of this Chapter has been published as a peer-reviewedarticle in the journal The American Statistician, see Campbell and Gustafson (2019).Code to reproduce all Results, Tables and Figures in this Chapter– Please notethat all code to produce the results, tables and figures in this chapter have beenposted to the Open Science Framework, DOI 10.17605/OSF.IO/YQCVA.3.5 Chapter 3 Appendix3.5.1 Distributions: further detailsHere we introduce some compact notation that will be useful for expressing distri-butional quantities of interest. Particularly, the numbers comprising the distribu-tion of studies across the eight categories are expressed as qabc(psp, pwr), wherea 2 {0,1} indicates the truth (a= 0 for null, a= 1 for alternative), b 2 {0,1} indi-cates the statistical finding (b= 0 for negative, b= 1 for positive), and c 2 {U,P}indicates publication status. As examples, we could write:q10U = FNUN = psp ·nS · (1B) ·Pr(FN), orq11P = TPPUB = psp ·nS ·A ·Pr(TP).We also use a plus notation to add over subscripts, so, for instance, q1+P = TPPUB+FNPUB.As motivated above, we consider properties that result from a scientist or groupof scientists stochastically allocating T resources (not studies per se) according toa distribution of (psp, pwr). Presuming the incentive to publish, the density of this62distribution is taken proportional to ENP(psp, pwr), which we express asfRES(psp, pwr) µ ENP(psp, pwr)µ {k+n(pwr)}1q++P(psp, pwr). (3.3)Consequently, the distribution of (psp, pwr) across attempted studies has densityfATM(psp, pwr) µ {k+n(pwr)}1 fRES(psp, pwr)µ {k+n(pwr)}2q++P(psp, pwr). (3.4)In turn, the distribution of (psp, pwr) across published studies has densityfPUB(psp, pwr) µ fATM(psp, pwr)q++P(psp, pwr)µ {k+n(pwr)}2{q++P(psp, pwr)}2. (3.5)Note particularly that fPUB(psp, pwr)µ { fRES(psp, pwr)}2. Hence the distributionof (psp, pwr) across published studies is a concentrated version of the distributiondescribing how resources are deployed.In order to properly define the metrics of interest, we will consider the pre-study probability (psp) and the power (pwr) as random variables, denoted PSPand PWR. These random variables jointly follow a distribution defined by one ofthree probability density functions: fRES(), fATM(), or fPUB().To better understand, consider a helpful analogy. Suppose one is interestedin the distribution of heights and weights across different populations. One mightconsider HEIGHT andWEIGHT as random variables corresponding to differentgroups of people. For example, the distribution of (HEIGHT,WEIGHT ) acrosschildren and across adults will be no doubt be different. For the two differentpopulations, one might therefore define two different probability density functions:fChildren() and fAdults(). Finally, with knowledge of these distributions, we couldconsider different metrics of interest, such as the body mass index (BMI).3.5.2 Ecosystem metrics: further detailsWe evaluate each ecosystem of interest on the basis of the following metrics.63The Reliability (REL)A highly relevant metric for the scientific ecosystem is the proportion of publishedfindings that are correct. In all the ecosystems we consider in this chapter, we makethe assumption that only positive results are published (i.e., B= 0). Therefore, wecan express reliability (REL) simply as:REL =EATM {q11P(PSP,PWR)}EATM {q+1P(PSP,PWR)} ,where EXYZ() denotes the expected value when (PSP,PWR) are distributed accord-ing to the fXYZ() probability density function.Note that we define REL as the ratio of expectations rather than as the expec-tation of a ratio. We can explain this choice as follows. Consider drawing m iidstudies from the ATM distribution. Of these m studies, let Bm be the number thatare published, and Am be the number that are published and are correct. Then wehave:Am ⇠ Binomial(m, p) andBm ⇠ Binomial(m,q),where p and q are EATM{q11P(PSP,PWR)} and EATM{q+1P(PSP,PWR)} respec-tively.Then, for fixed m, we could consider the distribution of reliability, i.e., the dis-tribution of Am/Bm. However, this would be of little interest since this distributionwill depend on the specific value of m. Instead, we consider the “large m” limit.Since Am/m and Bm/m converge to point masses at p and q, Am/Bm converges toa point mass at p/q, and we define our reliability metric as REL= p/q, which canbe interpreted as the reliability over a “large number” of studies.More generally, in ecosystems where negative results might be published (i.e.,B(psp, pwr) 6= 0), the reliability would equal the proportion of published papersthat reach a correct conclusion, i.e.,REL =EATM {q00P(PSP,PWR)+q11P(PSP,PWR)}EATM {q++P(PSP,PWR)} .64Publication Rate and the Number of Studies Attempted/PublishedIf T units of resources are deployed according to fRES(), then we expect that NATMstudies will be attempted, whereT1NATM = ERES⇥{k+n(PWR)}1⇤ .Similarly, NPUB studies will be published, withT1NPUB = ERES⇥{k+n(PWR)}1q++P(PSP,PWR)⇤ .The ratio NPUB/NATM, which does not depend on T , is of evident interest, as thepublication rate (PubR) for attempted studies.Breakthrough Discoveries (DSCV)The ability of the scientific ecosystem to produce breakthrough findings is an im-portant attribute. We quantify this in terms of spending T resource units yieldingan expectation of DSCV breakthrough results. Here a true breakthrough result isdefined as a true positive and published study that results from a psp value belowa threshold, i.e., a very surprising positive finding that gets published and also istrue. If we set the breakthrough threshold as psp< 0.05, thenT1DSCV = ERES⇢I(0,0.05)(PSP)q11P(PSP,PWR)k+n(PWR)Note that we can also write this in terms of the density of attempted studies,fATM:T1DSCV =Rh(PSP,PWR) fATM(PSP,PWR){k+n(PWR)}RfATM(PSP,PWR){k+n(PWR)}=EATM[I(0,0.05)(PSP)q11P(PSP,PWR)]EATM[k+n(PWR)].65where: h(PSP,PWR) = I(0,0.05)(PSP)q11P(PSP,PWR)k+n(PWR) .We also wish to quantify the reliability of those published studies with pspvalues below a threshold of 0.05. We have that:DREL =EATMI(0,0.05)(PSP)q11P(PSP,PWR) EATMI(0,0.05)(PSP)q+1P(PSP,PWR) .The Median Power of Published Studies (MpwrPUB)We already mentioned the relevance of the psp marginals of (3.4) and (3.5). In asimilar vein, the marginal distributions of pwr under each of these distributions arereadily interpreted metrics of the ecosystem. We define MpwrPUB as the medianof pwr under the fPUB() distribution.The Balance between Exploration and Confirmation (IQpspPUB)There has been much discussion about the desired balance between researcherslooking for a priori unlikely relationships versus confirming suspected relation-ships put forth by other researchers; see for example, Sakaluk (2016) and Kimmel-man et al. (2014). The marginal distribution of psp arising from fATM(psp, pwr)describes the balance between exploration versus confirmation for attempted stud-ies, while the pspmarginal from fPUB(psp, pwr) does the same for published stud-ies. More specifically, we report the interquartile ranges of these marginal distri-butions for a given ecosystem.The Accuracy of Published ResearchConsider extending fATM(psp, pwr) to fATM(psp, pwr,d,u), where:fATM(d,u|psp, pwr) = fATM(d|psp, pwr) fATM(u), and where- fATM(u) is the density of a standard uniform distribution; and- fATM(d|psp, pwr) is the density of a two-component mixture distributionsuch that:Pr(d = d ) = psp; and Pr(d = 0) = 1 psp.66In order to obtain a sample from the distribution of published effect sizes, we pro-ceed according to the following steps.Set w= 1, q= 1 and b= 1. While w< 1,000:1. Draw a Monte Carlo realization of (psp[b], pwr[b],d[b],u[b])from fATM(psp, pwr,d,u).2. Simulate a two-sample normally distributed dataset, data[b], with a samplesize of n(pwr[b]), a true difference in population means of µd = d[b], and atrue variance of s2 = 1.3. Run a standard t-test on data[b], and obtain the estimated effect size, dˆ[b], andthe two-sided p-value, pval[b].4. If u[b] < A(psp[b], pwr[b]) and pval[b] < a , then do:dˆ [q]+1P = dˆ[b]; and q= q+1.5. If u[b] < A(psp[b], pwr[b]) and pval[b] < a , and d[b] = d , then do:dˆ [w]11P = dˆ[b]; and w= w+1.6. Set b= b+1.Then we have that {dˆ+1P} is a Monte Carlo sample from the conditional dis-tribution of effect sizes given that studies are both positive and published. We alsohave that {dˆ11P} is a Monte Carlo sample from the conditional distribution of effectsizes given that studies are positive, and published, and true.Using these Monte Carlo samples, we can easily approximate the followingmetrics of interest:• Mean Absolute Published Effect Size (|µˆd |PUB). Approximated as themean of |{dˆ+1P}|.• Proportion of Published Effect Sizes that are Negative (µˆdnegPUB). Approx-imated as the proportion of {dˆ+1P} that are less than zero.• The “Type S” error. Approximated as the proportion of {dˆ11P} less than 0.67• The “Type M” error. Approximated as the mean of the absolute effect sizedivided by the true effect size: |{dˆ11P}|/d .68Figure 3.1: The nine gridded squares represent nine different research strategies. From left to right, we have psp = 0.2,0.5, and 0.8, and from bottom to top, we have pwr = 0.2, 0.5, and 0.8. For each pair of values for (psp, pwr),the different coloured areas represent the probabilities of a true positive (TP), a false positive (FP), a true negative(TN), and a false negative (FN). The figure corresponds to an ecosystem defined by a = 0.05, m = 3, k = 100,B= 0, µd = 0.20, s2 = 1, and T = 100,000. R-code to reproduce: https://goo.gl/6Yurda69Figure 3.2: Plots show the expected number of publications, ENP, for differ-ent values of PSP and PWR. The printed numbers correspond to valuesof ENP at psp = 0.2, 0.5, and 0.8; and pwr = 0.2, 0.5, and 0.8. Eachpanel represents one ecosystem defined by a = 0.05, A = (1 psp)m,and B= 0, with k and m as indicated by column and row labels respec-tively, and fixed d = 0.20. Depending on k and m, the value of (PSP,PWR) that maximizes ENP can change substantially. R-code to repro-duce: https://goo.gl/a66HiK70Without SSR With SSRα= 0.005α= 0.050.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.000.000.250.500.751.000.000.250.500.751.00PSPPWR0.000050.000100.00015DensityFigure 3.3: Heat-maps show the density of published papers,fPUB(psp, pwr), for the four publication policies of interest, with fixedd = 0.2, k = 500 andm = 3. R-code to reproduce: https://goo.gl/knMJHt71● ●●●●●●●●●●●●●● ●●●●● ● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ● ●●●δ=0.2 δ=0.43Without SSRWith SSR0.01 0.02 0.03 0.04 0.05 0.01 0.02 0.03 0.04 0.050.00.51.01.52.00.00.51.01.52.0αDSCVk20100500m●●●139Figure 3.4: Values of DSCV (Breakthrough Discoveries) for varying valuesof k, m and a; results for d = 0.2 and d = 0.43 on the left and rightpanels respectively. Top-panels show results with no power requirement(i.e., A= (1 psp)m). Bottom-panels show results with power require-ment (i.e., A defined as per equation 7 (with c50 = 0.5 and c95 = 0.8)).R-code to reproduce: http://tinyurl.com/yxpqjgk7.72●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●δ=0.2 δ=0.43Without SSRWith SSR0.01 0.02 0.03 0.04 0.05 0.01 0.02 0.03 0.04 0.050.00.10.20.30.40.00.10.20.30.4αEFFECT_SIZE_NEGk20100500m●●●139Figure 3.5: Values of µˆdnegPUB (the proportion of published effect sizes that arenegative) for varying values of k, m and a; results for d = 0.2 andd = 0.43 on the left and right panels respectively. Top-panel showsresults with no power requirement (i.e., A = (1 psp)m). Bottom-panel shows results with power requirement (i.e., A defined as perequation 3.2 (with c50 = 0.5 and c95 = 0.8)). R-code to reproduce:http://tinyurl.com/yxpqjgk7.73●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ●● ●●●●●●●●●●●●●●●●●●● ●●●●●●● ● ●●● ● ●●●●●●●●δ=0.2 δ=0.43Without SSRWith SSR0.01 0.02 0.03 0.04 0.05 0.01 0.02 0.03 0.04 0.050.20.40.60.80.20.40.60.8αABS_EFFECT_SIZEk20100500m●●●139Figure 3.6: Values of |µˆd |PUB (the average of the absolute published effectsize) for varying values of k, m and a; results for d = 0.2 and d = 0.43on the left and right panels respectively. Top-panel shows results with nopower requirement (i.e., A = (1 psp)m). Bottom-panel shows resultswith power requirement (i.e., A defined as per equation 3.2 (with c50 =0.5 and c95 = 0.8)). R-code to reproduce: http://tinyurl.com/yxpqjgk7.74●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●δ=0.2 δ=0.43Without SSRWith SSR0.01 0.02 0.03 0.04 0.05 0.01 0.02 0.03 0.04 0.050.250.500.751.000.250.500.751.00αRELk20100500m●●●139Figure 3.7: Values of REL (the reliability) for varying values of k, m anda; results for d = 0.2 and d = 0.43 on the left and right panels re-spectively. Top-panels show results with no power requirement (i.e.,A = (1 psp)m). Bottom-panels show results with power requirement(i.e., A defined as per equation 3.2 (with c50 = 0.5 and c95 = 0.8)). R-code to reproduce: http://tinyurl.com/yxpqjgk7.75●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●● ●● ●●●● ●● ●●●●● ●●● ●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●● ●● ●●● ●●● ●●● ●●●●●●●●●●●●●●δ=0.2 δ=0.43Without SSRWith SSR0.01 0.02 0.03 0.04 0.05 0.01 0.02 0.03 0.04 0.050.050.100.050.10αPRk20100500m●●●139Figure 3.8: Values of PubR (the publication rate) for varying values of k, mand a; results for d = 0.2 and d = 0.43 on the left and right panelsrespectively. Top-panels show results with no power requirement (i.e.,A = (1 psp)m). Bottom-panels show results with power requirement(i.e., A defined as per equation 3.2 (with c50 = 0.5 and c95 = 0.8)). R-code to reproduce: http://tinyurl.com/yxpqjgk7.76●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●δ=0.2 δ=0.43Without SSRWith SSR0.01 0.02 0.03 0.04 0.05 0.01 0.02 0.03 0.04 0.050.20.40.60.80.20.40.60.8αDRELk20100500m●●●139Figure 3.9: Values of DREL (the reliability of breakthrough discoveries) forvarying values of k,m and a; results for d = 0.2 and d = 0.43 on the leftand right panels respectively. Top-panels show results with no powerrequirement (i.e., A = (1 psp)m). Bottom-panels show results withpower requirement (i.e., A defined as per equation 3.2 (with c50 = 0.5and c95 = 0.8)). R-code to reproduce: http://tinyurl.com/yxpqjgk7.77●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ● ●●●● ● ●●●δ=0.2 δ=0.43Without SSRWith SSR0.01 0.02 0.03 0.04 0.05 0.01 0.02 0.03 0.04 0.050.20.40.60.80.20.40.60.8αPWR.PUB.MEDk20100500m●●●139Figure 3.10: Values of MpwrPUB (the median power of published studies)for varying values of k, m and a; results for d = 0.2 and d = 0.43 onthe left and right panels respectively. Top-panels show results with nopower requirement (i.e., A= (1 psp)m). Bottom-panels show resultswith power requirement (i.e., A defined as per equation 3.2 (with c50 =0.5 and c95 = 0.8)). Note that the fixed study cost, k, as a proportionof total cost is much smaller for d = 0.43 vs. for d = 0.2. As such itis much “less expensive” to attain high power. R-code to reproduce:http://tinyurl.com/yxpqjgk7.78●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●δ=0.2 δ=0.43Without SSRWith SSR0.01 0.02 0.03 0.04 0.05 0.01 0.02 0.03 0.04 0.050.10.20.30.10.20.3αPSP.PUB.RANGEk20100500m●●●139Figure 3.11: Values of IQpspPUB (the interquartile range of the distributionof psp values amongst published studies) for varying values of k, mand a; results for d = 0.2 and d = 0.43 on the left and right panelsrespectively. Top-panels show results with no power requirement (i.e.,A= (1 psp)m). Bottom-panels show results with power requirement(i.e., A defined as per equation 3.2 (with c50 = 0.5 and c95 = 0.8)).R-code to reproduce: http://tinyurl.com/yxpqjgk7.79●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●● ●● ●●δ=0.2 δ=0.43Without SSRWith SSR0.01 0.02 0.03 0.04 0.05 0.01 0.02 0.03 0.04 0.051.52.02.51.52.02.5αTYPE_M_ERRORk20100500m●●●139Figure 3.12: Values of the approximated exaggeration ratio (the expectedType M error) for varying values of k, m and a; results for d = 0.2 andd = 0.43 on the left and right panels respectively. Top-panels showresults with no power requirement (i.e., A = (1 psp)m). Bottom-panels show results with power requirement (i.e., A defined as perequation 3.2 (with c50 = 0.5 and c95 = 0.8)). R-code to reproduce:http://tinyurl.com/yxpqjgk7.80●●●●●●●●●●●●●●● ●●●● ●●●●●●●●● ● ●δ=0.2 δ=0.43Without SSRWith SSR0.01 0.02 0.03 0.04 0.05 0.01 0.02 0.03 0.04 0.050.000.010.020.030.000.010.020.03αTYPE_S_ERRORk20100500m●●●139Figure 3.13: Values of the Type S error rate (the proportion of published stud-ies with an error in sign) for varying values of k, m and a; results ford = 0.2 and d = 0.43 on the left and right panels respectively. Top-panels show results with no power requirement (i.e., A= (1 psp)m).Bottom-panels show results with power requirement (i.e., A defined asper equation 3.2 (with c50 = 0.5 and c95 = 0.8)). R-code to reproduce:http://tinyurl.com/yxpqjgk7.81Chapter 4To make more published researchtrue, should scientists adoptalternatives to null hypothesissignificance testing?In Chapter 3, our model made a significant assumption about the scientific publica-tion process. We assumed that “null results” (pvaluea) were never published.In this chapter we will relax this assumption in order to examine the merits of threealternative publication policies: (1) the Registered Reports (RR) policy, (2) theBayesian Registered Reports (or “Bayes Factor Registered Reports,” BF-RR) pol-icy, and (3) the Conditional Equivalence Testing (CET) policy, all three policies asdescribed in Chapter 2. These three policies explicitly identify mechanisms for thepublication of null results. The goal of this chapter is to extend the model frame-work from Chapter 3 so as to evaluate the operating characteristics of these threealternative publication policies.In order to extend the model framework, we must distinguish between twodistinct types of null results: negative results and inconclusive results. The negativeresults are those for which, based on a certain criteria, the data suggest the absenceof any effect. The inconclusive results are those for which neither the presence of82an effect nor the absence of an effect are suggested by the data.The probability of a negative result and the probability of an inconclusive re-sult will depend on the particular tool (e.g., CET p-value, or BF) and the specificevidence threshold being used. For example, under the frequentist CET testingscheme (as described in Section 2.1), a negative result occurs when both the p1value is above the a1 threshold and the p2 value is less than the a2 threshold. Aninconclusive result occurs when both the p1 value is above the a1 and the p2 valueis above the a2 threshold. In the Bayesian testing scheme described in Section2.3 with a threshold of 3, a negative result occurs when the BF01 is greater than 3,and an inconclusive result occurs when 1/3 BF01 < 3. In order to incorporatethese two different types of “null results” into the model framework as outlined inSection 3.1.3, three changes to the model will be made.First, we must expand the original eight categories listed in Table 3.1, to twelvedifferent categories of findings, see Table 4.1. In Chapter 3, we defined A andB as the probability that a positive result is published, and the probability thata null result is published, respectively. In this extension, we will re-define B asthe probability that a negative result is published and introduce C, defined as theprobability that an inconclusive result is published.Second, we must update the compact notation (see Section 3.5.1). Let us defineqabc(psp, pwr), so that, as before a 2 {0,1} indicates the truth (a = 0 for null,a= 1 for alternative), and as before, c 2 {U,P} indicates publication status. Goingforward, let b 2 {N, I,P} indicate the statistical finding (b= N for negative, b= Ifor inconclusive, b = P for positive). (The metrics defined in Section 3.5.2 forwhich the b subscript equals “1” should be updated so that b= P.)Finally, in Section 3.1.4 we defined a number of different “Ecosystem Met-rics” used to describe the overall characteristics of interest of a given publicationecosystem. Going forward, we will restrict our interest to a subset of those metricsdefined in Chapter 3 and we now also define and investigate a few additional met-rics: the average sample size of published studies (AvgSSPUB), the Journal ContentRatio metric (JCR), and also different reliability metrics.83Expected Number of... EquationTrue positives published TPPUB = psp ·nS ·A ·Pr(TP)True positives unpublished TPUN = psp ·nS · (1A) ·Pr(TP)False negatives published FNPUB = psp ·nS ·B ·Pr(FN)False negatives unpublished FNUN = psp ·nS · (1B) ·Pr(FN)Inconclusive (pos) published IPPUB = psp ·nS ·C ·Pr(INP)Inconcl. (pos) unpublished IPUN = psp ·nS · (1C) ·Pr(INP)False positives published FPPUB = (1 psp) ·nS ·A ·Pr(FP)False positives unpublished FPUN = (1 psp) ·nS · (1A) ·Pr(FP)True negatives published TNPUB = (1 psp) ·nS ·B ·Pr(TN)True negatives unpublished TNUN = (1 psp) ·nS · (1B) ·Pr(TN)Inconclusive (neg) published INPUB = (1 psp) ·nS ·C ·Pr(INN)Inconcl. (neg) unpublished INUN = (1 psp) ·nS · (1C) ·Pr(INN)Table 4.1: Equations for the expected number of studies (out of a total of nSstudies) for each of the twelve categories with A = probability of publi-cation for a positive result; B = probability of publication for a negativeresult; andC = probability of publication for an inconclusive result.4.0.1 Average sample size of published studies:The average sample size of published studies is a useful metric, particularly whencomparing publication policies based on BFs with those based on p-values. Wecalculate this metric as:AvgSSPUB = EPUB{n(PWR)} (4.1)Journal Content Ratio:The proportion of articles published that are positive findings can be summarizedby the Journal Content Ratio (JCR) which we define as follows:JCR = (TPPUB+FPPUB)/(TPPUB+FPPUB+TNPUB+FNPUB+IPPUB+ INPUB) (4.2)= q+PP/(q++P) (4.3)84A publication policy with a low JCR will have a relatively high number of “nullpublications,” whereas a journal with a high JCR will consist primarily of positivestudies, i.e., will include primarily “significant results.”In practice, we believe journals are under pressure to publish more positivestudies than negative and inconclusive studies in order to increase or maintain theirimpact factor rating (Lawton, 2016, Pusztai et al., 2013). As such, we can think ofa higher JCR as being more desirable from a journal’s perspective.It is conceivable that a policy change that increases the reliability of publica-tions at the cost of substantially lowering the JCR would not receive much sup-port from journals and would therefore be difficult to implement. The relationshipbetween the impact factor and number of published negative studies has been dis-cussed previously; see Lawton (2016), Littner et al. (2005), and e.g., Pusztai et al.(2013): “Journals are themselves under pressure to publish positive discoveries thatwill be more highly cited and will increase impact factor rating.”Reliability:We further refine the reliability metric, as defined in Section 3.1.4, so as to dis-tinguish between overall reliability, OR, and reliability amongst the subset of pub-lished studies that are positive, PREL, i.e.:OR =EATM {q0NP(PSP,PWR)+q0IP(PSP,PWR)+q1PP(PSP,PWR)}EATM {q++P(PSP,PWR)} , (4.4)PREL =EATM {q1PP(PSP,PWR)}EATM {q+PP(PSP,PWR)} . (4.5)Note that we choose to include q0IP(PSP,PWR) in the numerator of the ORequation implying that an inconclusive finding for a study for which the null is trueis desirable from a reliability perspective. While not ideal, in situations when thenull is indeed true, an inconclusive result is definitely preferable to a positive result.We also wish to specify that the DREL metric (previously defined in Section3.5.2) quantifies the reliability of those published positive studies with psp values85below a threshold of 0.05. The notation for DREL is therefore updated as:DREL =EATMI(0,0.05)(PSP)q1PP(PSP,PWR) EATMI(0,0.05)(PSP)q+PP(PSP,PWR) .The remainder of this chapter is structured as follows. Sections 4.1, 4.2, and4.3 will each consider one of the three alternative publication policies of interest:(1) the RR policy, (2) the BF-RR policy, and (3) the CET policy. For each of these,we will compute the metrics of interest for 18 ecosystems defined by one of threepossible values for m (= 1, 3, 9), one of three possible values for k (= 20, 100, 500),and one of two possible values for d (= 0.20, 0.43).Let us define some shorthand notation for three publication policies we consid-ered in Chapter 3:1. SP0- the standard policy with a = 0.05 and no SSR (the characteristics ofwhich were previously summarized in Table 3.2);2. SP1- the policy with a = 0.005 and no SSR (the characteristics of whichwere previously summarized in Table 3.4); and3. SP2- the policy with a = 0.05 and a 80% power SSR (c50 = 0.5; ,c95 = 0.8)(the characteristics of which were previously summarized in Table 3.5).We will consider the consequences of adopting the RR policy (Section 4.1), andthe BF-RR policy (Section 4.2), by examining how the metrics of interest changerelative to the SP0 policy. We will also consider how both these alternative policiescompare, in terms of the metrics of interest, to the policies of “greater statisticalstringency,” SP1 and SP2. (We will not consider the a = 0.005 and SSR policy(i.e., the policy discussed in Section 3.3.2) for brevity.) In Section 4.3, we willinvestigate the operating characteristics of the novel CET policy and compare it, interms of the metrics of interest, with the RR policy.4.1 A model for Registered ReportsThe RR policy, in which pre-registration is directly embedded within the publica-tion pipeline (see Section 2.4), is currently being adopted with much enthusiasm at86a growing number of journals (Frankenhuis and Nettle, 2018). It is worth notinghowever, that the RR policy, while a “comparatively young publication format”(Nature Human Behaviour Editorial, 2018), is not necessarily a new idea.For example, Walster and Cleary (1970) considered a very similar “proposal fora new editorial policy in the social sciences.” A similar publishing model was alsoinitiated in the mid-1970s by parapsychologist Martin Johnson in The EuropeanJournal of Parapsychology (EJP); see Wiseman et al. (2019) for details. Also,between 1997 and 2015, The Lancet maintained a publication policy similar to RRfor clinical trials (McNamee et al., 2008).Despite this history, as of yet, there have been no large-scale efforts to modelthe impact of adopting a RR-type policy on the type of scientific studies that arepursued and ultimately published. The model developed in Chapter 3 can be usedto do just this.Within our model framework, a RR policy will consider a study with a p-value < aas positive, while a study with a p-value a is inconclusive. Furthermore, a RRpolicy will be characterized by setting A=C. The value of the B parameter is irrel-evant for RR policies since no studies are considered negative. The equality A=Crepresents the “result-blind” aspect of the RR policy: the probability of publishinga study does not depend on the result.Since the RR policy requires researchers to show that their target sample sizeis sufficient to ensure 90% power (in some cases 80% power) in order to obtainin principle acceptance, it is reasonable to assume that the pwr directly impacts(to a large extent) the probability of publication. We will incorporate the powercalculation requirement aspect of the RR policy as was done previously in Section3.3. The psp will also impact the probability of publication, as in the previousversions of the model. The “result-blind” probability of publication will thereforedepend on parameters m, c50 and c95, as in equation 3.2, so that we have:A=C =(1 psp)m1+ expn(log19) pwrc50c95c50o , (4.6)where we will set c50 = 0.5 , and c95 = 0.8 (as in Chapter 3) for all scenarios going87forward.The remainder of this section is structured as follows. In Section 4.1.1, weconsider how our metrics of interest change with the RR policy compared to theSP0 policy. In Section 4.1.2, we consider the merits of adopting a standard RRpolicy relative the “more stringent” policies SP1 and SP2.4.1.1 The RR policy relative to the SP0 policyFigure 4.1 plots the expected number of publications (ENP) for different psp andpwr combinations under a RR policy. Note that the optimal (= highest ENP) strate-gies are of small psp and mid-to-large pwr. This represents a substantial difference(and potential improvement) over the SP0 policy. Indeed, under the SP0 policy, theoptimal strategy is one of low psp and low pwr; compare Figure 4.1 with Figure3.2.Table 4.2 lists the metrics of interest for ecosystems defined by a RR policy andvarying values of k, m, and d . Table 4.3 compares the metrics of interest computedfor the RR policy to those computed for the SP0 policy. This comparison allows usto make the following conclusions on the impact of adopting a RR policy relativeto the SP0 policy (standard policy with a = 0.05, and no SSR).• The observed journal content ratio (JCR) suggests that, under a RR policy,the majority of published studies will be null/inconclusive results. What’smore, when m increases, JCR values decrease. In those scenarios consid-ered with m = 9, less than 10% of published studies are positive studies.This brings to mind the “dumping ground” (Greve et al., 2013) concernspreviously discussed in Chapter 1. Indeed, it is a common view among ed-itors that the “problem with preregistration is that it makes results boring”(Chambers, 2018b).Due to the fact the RR is still in an early stage of adoption, there are few em-pirical estimates of the JCR amongst RR publications. Two studies providerelevant estimates. Kaplan and Irvin (2015) estimate that, amongst publica-tions of large clinical trials (large = with direct costs above $500,000/year)that required pre-registration with clinicaltrials.gov, the JCR was approxi-mately 8%. Allen and Mehler (2018) calculate that among 127 published8862179927 7252494 591546318 01160 1500175 00016 051118218 4212186 541335311 01155 1300143 00014 0284456 2111559 3771294 00138 90081 00010 0k= 20 k= 100 k= 500m= 1m= 3m= 90.2 0.5 0.8 0.2 0.5 0.8 0.2 0.5 0.80.20.50.80.20.50.80.20.50.8PSPPWR50100ENPFigure 4.1: Plots show the expected number of publications, ENP, for dif-ferent values of psp and pwr under a RR policy. The printed numberscorrespond to values of ENP at psp = 0.2, 0.5, and 0.8; and pwr = 0.2,0.5, and 0.8. Each panel represents one ecosystem defined by a = 0.05,A = C (according to equation 4.7), with k and m as indicated by col-umn and row labels respectively, and d = 0.20. Depending on k and m,the value of (psp, pwr) that maximizes ENP can change substantially.R-code to reproduce: http://tinyurl.com/yyltzbje.89biomedical and psychological science studies published under a RR policy,only 39% reported positive findings.• A RR policy exhibits a substantially higher publication rate (at least fourtimes as high). This should come as no surprise since a RR policy publishesboth positive and inconclusive findings. A RR policy also publishes largerstudies (see Table 4.3 “Change in AvgSSPUB”). This is due to the fact that,under a RR policy, researchers are compelled to reach their pre-registeredtarget sample sizes.• Overall reliability (OR) is substantially increased with a RR policy. How-ever, positive reliability (PREL) is not necessarily increased. A substantialincrease in PREL is associated with ecosystems where data are relatively ex-pensive (k is small) and where there is a large emphasis on novelty (m islarge). The increase in OR can be explained by the fact that under a RR pol-icy, a large proportion of published studies will be inconclusive, and thesetend to be more “reliable” than positive publications. Breakthrough discover-ies are substantially more reliable (DREL) with a RR policy for all ecosystemsconsidered.• In scenarios in which k is large (i.e., collecting data is relatively inexpensive),and effect sizes are large, a RR policy will lead to a potentially substantialincrease in the number of breakthrough discoveries (see Table 4.3, Changein DSCV). However, when both k and d are small (i.e., collecting data is rel-atively expensive and effect sizes are small), adopting a RR policy may leadto a decrease in the number of breakthrough discoveries, particularly if m islarge. Indeed, when d = 0.2, k= 20 and m= 9, the number of breakthroughdiscoveries is actually 41% lower than with the standard policy. This sug-gests that in fields where the pre-study odds are often low, collecting data isexpensive, and effect sizes are typically small (e.g., in exploratory researchfields such as neuroscience (Szucs and Ioannidis, 2017b)), adopting a RRpolicy may lead to fewer (albeit more reliable) breakthrough discoveries.90d k m PubR OR PREL DSCV DREL JCR AvgSSPUB0.20 20 1 0.25 0.870 0.81 0.17 0.26 0.20 3320.20 20 3 0.21 0.910 0.65 0.30 0.25 0.12 3320.20 20 9 0.19 0.933 0.41 0.52 0.24 0.08 3320.20 100 1 0.35 0.876 0.82 0.16 0.27 0.20 4490.20 100 3 0.30 0.913 0.66 0.28 0.26 0.13 4490.20 100 9 0.27 0.935 0.42 0.48 0.24 0.08 4490.20 500 1 0.44 0.887 0.82 0.11 0.28 0.21 5680.20 500 3 0.38 0.918 0.67 0.19 0.27 0.13 5680.20 500 9 0.34 0.937 0.44 0.33 0.25 0.08 5680.43 20 1 0.35 0.876 0.82 0.74 0.27 0.20 990.43 20 3 0.30 0.913 0.66 1.30 0.26 0.13 990.43 20 9 0.27 0.935 0.42 2.20 0.24 0.08 990.43 100 1 0.44 0.886 0.82 0.53 0.28 0.21 1230.43 100 3 0.37 0.918 0.67 0.92 0.27 0.13 1230.43 100 9 0.34 0.937 0.44 1.56 0.25 0.08 1230.43 500 1 0.50 0.897 0.83 0.22 0.29 0.22 1470.43 500 3 0.43 0.923 0.68 0.39 0.29 0.14 1470.43 500 9 0.39 0.939 0.45 0.66 0.26 0.09 147Table 4.2: For ecosystems defined by a RR policy and varying values of k,m, and d , the table lists estimates for the metrics of interest. These in-clude the publication rate (PubR), reliability metrics (OR, PREL, DREL),the number of true breakthrough discoveries (DSCV ), the average totalsample size of published studies (AvgSSPUB), and the journal contentratio (JCR). R-code to reproduce: http://tinyurl.com/yyduldap.4.1.2 The RR policy relative to SP1 and SP2- The effects of adoptinga RR policy relative to adopting policies of greater statisticalstringencyFor the same range of values for m (= 1, 3, 9), k (= 20, 100, 500), and d (= 0.20,0.43), we compared the standard RR policy to two policies of “greater statisticalstringency,” SP1, and SP2. See Tables 4.4 and 4.5. Based on our results, we canmake the following conclusions on the impact of adopting a RR policy relative tothese policies of “greater statistical stringency.”1. For metrics OR and AvgSSPUB, the impact of adopting a RR policy appearsto be similar to the impact of adopting the SP1 or the SP2 policy.2. The major differences between a RR policy and the “statistically stringent”policies can be seen in terms of DSCV , DREL, PubR, and JCR. The RR91Ratio of ecosystem with RR to ecosystem with SP0 for:d k m PubR OR PREL DSCV DREL AvgSSPUB JCR0.20 20 1 4.47 1.41 1.31 1.05 3.72 2.48 0.200.20 20 3 6.14 2.51 1.79 0.80 3.74 3.64 0.120.20 20 9 7.20 6.31 2.78 0.60 3.82 5.73 0.080.20 100 1 4.34 1.16 1.08 1.76 2.25 1.59 0.200.20 100 3 6.93 1.74 1.25 1.30 2.27 1.97 0.130.20 100 9 9.34 3.78 1.70 0.93 2.30 2.73 0.080.20 500 1 4.02 1.06 0.99 2.60 1.53 1.23 0.210.20 500 3 7.01 1.41 1.03 1.87 1.54 1.37 0.130.20 500 9 10.75 2.60 1.21 1.27 1.56 1.68 0.080.43 20 1 4.34 1.16 1.08 1.76 2.24 1.58 0.200.43 20 3 6.93 1.74 1.25 1.30 2.26 1.94 0.130.43 20 9 9.35 3.77 1.69 0.93 2.30 2.66 0.080.43 100 1 4.03 1.07 0.99 2.57 1.55 1.24 0.210.43 100 3 7.01 1.42 1.04 1.85 1.56 1.38 0.130.43 100 9 10.71 2.63 1.22 1.26 1.57 1.69 0.080.43 500 1 3.84 1.04 0.96 3.15 1.29 1.10 0.220.43 500 3 6.93 1.30 0.97 2.23 1.30 1.18 0.140.43 500 9 11.38 2.20 1.05 1.48 1.30 1.37 0.09Table 4.3: A comparison between ecosystems with a RR policy and ecosys-tems with the SP0 policy (those listed in Table 3.2). The table lists esti-mates for the ratios of the publication rate (PubR), the ratio of the reliabil-ity metrics (OR, PREL,DREL), the ratio of the number of true breakthroughdiscoveries (DSCV ), the ratio of the average sample size of publishedstudies (AvgSSPUB), and the ratio of the journal content ratio (JCR). R-code to reproduce: http://tinyurl.com/yy26ujqt.policy will produce substantially more novel discoveries than the SP1 andSP2 policies. However, these “discoveries” will be less reliable (this may bebecause these “low psp studies” are of lower pwr, but also of much lowerpsp). While there may be many more discoveries, it must be noted that therewill also be many more publications overall. As expected, the publicationrate for RR is much higher than for SP1 and SP2. As discussed earlier, mostpublications under a RR policy will be null/inconclusive findings which iswhy we see such low JCR values.3. For different values of k there does not appear to be any substantial changesin how RR compares to SP1 and SP2. As such, we cannot say that RRis particularly better or particularly worse than the “statistically stringent”92policies for specific situations when data are affordable or expensive.4. The positive reliability is slightly lower with the RR policy compared to theSP2 policy, and substantially lower compared to the SP1 policy, particularlyfor larger m. For scenarios where m = 9, positive publications under theRR policy are approximately 83% as reliable as positive publications underthe SP2 policy and about 50% as reliable as positive publications under theSP2 policy. This is due to the fact that while the typical power of studies issimilar for RR, SP1 and SP2 policies, the typical psp of studies under theRR policy is much lower. Researchers are more incentivized to test highlyunlikely hypotheses since publication will be granted even if these studies“fail” (i.e., even if an inconclusive result is obtained).In summary, results from our model suggest that the RR policy is advantageousrelative to policies of simply lowering the significance threshold or implementingsample size requirements, in terms of a higher rate of breakthrough discoveries.This can be easily explained. Since, under the RR policy, the probability of pub-lication is higher with a lower psp, and does not depend on a study’s result, it isclearly more advantageous to pursue low psp studies. In turn, lower psp studieslead to a larger number of discoveries. Our model also suggests two main disad-vantages with the RR policy.First, the RR may potentially result in a lower reliability of published positivestudies. Second, under the RR policy, the JCR will be low. Whereas under SP1and SP2, all publications are positive (i.e., JCR=1), under a RR policy the journalcontent ratio (JCR) may be very low. In a recent discussion article, Grand et al.(2018) discuss concerns that RR approaches will result in more null results that donot contribute to understanding. The authors write: “We should not simply con-jecture that the literature will become rife with null results without implementingthese approaches and evaluating the fruits of such efforts.” Perhaps our model-ing exercise allows us to make an “educated conjecture,” that the vast majority ofpublished studies under a RR policy will be inconclusive or negative.93Ratio of ecosystem with RR to ecosystem with SP1 for:d k m PubR OR PREL DSCV DREL AvgSSPUB JCR0.20 20 1 3.93 0.89 0.83 8.70 0.49 0.64 0.200.20 20 3 8.66 0.97 0.69 5.50 0.49 0.66 0.120.20 20 9 22.54 1.17 0.52 2.90 0.48 0.74 0.080.20 100 1 4.41 0.89 0.83 9.16 0.41 0.63 0.200.20 100 3 9.48 0.96 0.69 5.77 0.41 0.65 0.130.20 100 9 24.12 1.09 0.49 3.02 0.39 0.69 0.080.20 500 1 4.40 0.90 0.84 9.12 0.38 0.63 0.210.20 500 3 9.37 0.95 0.70 5.72 0.37 0.64 0.130.20 500 9 23.53 1.04 0.49 2.96 0.35 0.66 0.080.43 20 1 4.37 0.89 0.83 9.25 0.41 0.63 0.200.43 20 3 9.39 0.96 0.69 5.82 0.40 0.64 0.130.43 20 9 23.87 1.09 0.49 3.05 0.39 0.68 0.080.43 100 1 4.40 0.90 0.84 9.19 0.38 0.63 0.210.43 100 3 9.36 0.95 0.70 5.76 0.37 0.63 0.130.43 100 9 23.53 1.04 0.49 2.98 0.35 0.65 0.080.43 500 1 4.19 0.91 0.84 8.41 0.37 0.62 0.220.43 500 3 8.90 0.95 0.70 5.26 0.36 0.63 0.140.43 500 9 22.35 1.02 0.49 2.70 0.35 0.64 0.09Table 4.4: A comparison between ecosystems with a RR policy and ecosys-tems with the SP1 policy. The table lists estimates for the ratios of thepublication rate (PubR), the ratio of the reliability metrics (OR, PREL,DREL), the ratio of the number of true breakthrough discoveries (DSCV ),the ratio of the average sample size of published studies (AvgSSPUB),and the ratio of the journal content ratio (JCR). R-code to reproduce:http://tinyurl.com/yxfcsyz7.4.1.3 Concluding thoughts on RRWhile it is true that RR policies are gaining traction in many fields, in every journalthat has adopted RR thus far, the RR publication route is available only as an optionfor researchers submitting their work. Traditional publication avenues still remainavailable for submitting a manuscript to all of these journals. We have not consid-ered such a hybrid policy. When observing how RR currently impacts researcherincentives, we must acknowledge that, as things are currently implemented, re-searchers may “use [RR] strategically for hypotheses that they anticipate will notwork out, given that registered reports more or less guarantee publication” (Warren,2018).We must also note that the RR policy is not uniformly implemented across par-94Ratio of ecosystem with RR to ecosystem with SP2 for:d k m PubR OR PREL DSCV DREL AvgSSPUB JCR0.20 20 1 3.07 0.98 0.92 3.26 0.94 0.83 0.200.20 20 3 5.92 1.21 0.86 2.25 0.93 0.85 0.120.20 20 9 11.16 1.89 0.83 1.44 0.93 0.90 0.080.20 100 1 3.58 0.99 0.92 3.35 0.94 0.85 0.200.20 100 3 6.73 1.20 0.87 2.31 0.94 0.87 0.130.20 100 9 12.20 1.86 0.84 1.46 0.93 0.91 0.080.20 500 1 3.74 0.99 0.92 3.47 0.94 0.87 0.210.20 500 3 7.00 1.19 0.87 2.37 0.94 0.88 0.130.20 500 9 12.59 1.81 0.84 1.49 0.93 0.92 0.080.43 20 1 3.58 0.99 0.92 3.35 0.94 0.85 0.200.43 20 3 6.74 1.20 0.87 2.31 0.94 0.87 0.130.43 20 9 12.20 1.86 0.84 1.46 0.93 0.91 0.080.43 100 1 3.74 0.99 0.92 3.46 0.94 0.87 0.210.43 100 3 7.00 1.19 0.87 2.37 0.94 0.88 0.130.43 100 9 12.59 1.81 0.84 1.49 0.93 0.92 0.080.43 500 1 3.69 1.00 0.93 3.55 0.94 0.88 0.220.43 500 3 6.94 1.18 0.88 2.42 0.94 0.89 0.140.43 500 9 12.57 1.75 0.84 1.51 0.93 0.92 0.09Table 4.5: A comparison between ecosystems with a RR policy and ecosys-tems with the SP2 policy. The table lists estimates for the ratios of thepublication rate (PubR), the ratio of the reliability metrics (OR, PREL,DREL), the ratio of the number of true breakthrough discoveries (DSCV ),the ratio of the average sample size of published studies (AvgSSPUB),and the ratio of the journal content ratio (JCR). R-code to reproduce:http://tinyurl.com/y687u5n2.ticipating journals; see Hardwicke and Ioannidis (2018). Each journal has adaptedthe general RR framework to suit its objectives and the given research field. Forexample, different journals may choose to implement their RR policies with differ-ent significance thresholds (i.e., higher or lower values of a), and different levelsfor the required a priori power. Consider two recent cases.First, editors for the journal Nature Human Behaviour have updated their pol-icy and now require all frequentist hypothesis tests in RRs to be powered to at least95% (Nature Human Behaviour Editorial, 2018). Second, editors for the journalCortex have decided to lower the a significance threshold for RRs from a = 0.05to a = 0.02 (Cortex Editorial Board, 2018). The required minimum for a prioripower was kept at the previously established level of 90%.Changing what qualifies as a “significant” finding, will have a substantial im-95pact on what sample sizes qualify for sufficient statistical power (i.e., Power =Pr(p value < a|q ,n)) and will therefore impact all studies seeking publication.Whether the level required for a priori power is raised or the threshold for sig-nificance is lowered, the result will be that larger sample sizes will be needed forpublication. For example, suppose a researcher wishes to publish a study (a two-sample, two-sided t-test) under a RR policy with a = 0.05. If the a priori effectsize estimate is d = 0.21, and the variance is s2 = 1, then the required samplesize for 90% power will be a total of n = 956 observations (478 per sample). Incontrast, if the significance threshold is changed a = 0.02, all else being equal, therequired sample size for 90% power will be n = 1,184 (592 per sample). (Thesenumbers can be calculated as per equation 2.1.)With regards to the customization of RR policies, we advise that stakeholdersshould model the impact of such changes prior to their implementation. Whilerequiring researchers to use larger sample sizes for publication will no doubt in-crease reliability, it may also have the unintended effect of reducing the numbernovel discoveries. Worse still, it may lead to furthering the improper use of powercalculations by unintentionally motivating researchers with limited resources toexaggerate their a priori power.What’s more, greater statistical stringency for RR policies may discourage re-searchers from choosing to publish their work under the RR scheme. Indeed, untilRR is a more widely used publication scheme, researchers may be wary. Already,some have expressed concerns about whether or not a study published under a RRpolicy will receive a similar number of (highly valued) citations as a study pub-lished under a traditional publication policy (Chambers, 2018a).4.2 A model for Bayesian Registered ReportsA Bayesian RR policy has been implemented in several journals as an option forresearchers. For example, the journal Comprehensive Results in Social Psychology(CRSP) specifies the following in its submission guidelines for researchers: “Forinference by Bayes factors, authors should guarantee testing participants until theBayes factor is either more than 3 or less than 0.33 to ensure clear conclusions,”(Jonas and Cesario, 2017).96As another example, the journal BMC Biology specifies the following in its ed-itorial policy: “For inference by Bayes factors, authors must be able to guaranteedata collection until the Bayes factor is at least 6 times in favour of the experi-mental hypothesis over the null hypothesis (or vice versa). Authors with resourcelimitations are permitted to specify a maximum feasible sample size at which datacollection must cease regardless of the Bayes factor; however to be eligible for ad-vance acceptance this number must be sufficiently large that inconclusive results atthis sample size would nevertheless be an important message for the field,” (BMCBiology Editorial Board, 2017).We previously discussed the BF-RR policy in Section 2.4. As of yet, there havebeen very few publications under this policy and no large-scale efforts to modelthe impact of fully adopting a BF-RR policy. One example of a study that hasbeen published under the BF-RR policy is Coltheart et al. (2018) published in thejournal Cortex. Unfortunately, this study suffered from errors in execution due to,among other things, “miscommunication” about the BF. Indeed, the study serves toillustrate some of the practical difficulties one may encounter with research underthe BF-RR policy. In particular, the authors note that, with small sample sizes, theBF can suffer from a large degree of “instability” (Coltheart et al., 2018).In this section, the model developed in Chapter 3 will be used to model thepublication process under a BF-RR policy. One limitation in our model frameworkis that we work under the assumption that study sample sizes are fixed in advance.Clearly, policies such as the editorial policies of CRSP and BMC Biology are sug-gesting to researchers that data collection is done on a continuous basis, ideallyuntil a satisfactory conclusion (based on the BF) can be made.Under our model framework, the BF-RR policy considers a study with a BF01>BFthres as negative, while a study with a BF01 1/BFthres is considered positive(see Section 2.3). A study with a 1/BFthres < BF01 BFthres is therefore consid-ered inconclusive. Our model framework will characterize the BF-RR policy bysetting A = B. This equality, A = B, represents the fact that, under a BF-RR pol-icy, the probability of publication for a negative result is equal to the probabilityof publication for a positive result. The value of C is zero for BF-RR policiessince inconclusive studies are not published. As such, we will disregard the casesof authors “with resource limitations” who could conceivably publish “sufficiently97large” inconclusive results as per the BMC Biology policy.Since the BF-RR policy does not require researchers to show any sample sizecalculations, it is reasonable to assume that the size of the study does not (di-rectly) impact the probability of publication. Also note that, discussing the “powerof a study” in this Bayesian context requires additional care. The frequentista significance threshold is not defined, and therefore the power of a study isnot Pr(p value < a|alternative). However, we still wish to investigate fre-quentist properties. Therefore, we define power, in this context, as Pr(BF01 <1/BFthres|alternative). Recall that previously, research strategies (i.e., (psp, pwr)combinations) were restricted within ([0,1] x [a ,1]). For the BF-RR policy, strate-gies are restricted to have (psp, pwr) within ([0,1] x [0,1]).The psp will impact the probability of publication, as is generally assumed.Therefore, we have the simple function for probability of publication (as used forpolicies SP0 and SP1), so that:A= B = (1 psp)m; C = 0. (4.7)We also assume that all studies are two-sample tests of normally distributeddata and are tested using a JZS-BF testing scheme; for details see Rouder et al.(2009). Finally, we will consider the BF-RR policy with an evidence threshold ofBFthres = 3.The remainder of this section is structured as follows. In Section 4.2.1, weconsider how our metrics of interest change with the BF-RR policy compared tothe SP0 policy. In Section 4.2.2, we consider the merits of adopting a BF-RRpolicy relative to the “more stringent” policies SP1 and SP2 previously consideredin Sections 3.2 and 3.3.4.2.1 The BF-RR policy relative to SP0In Table 4.6, we tabulate how the various metrics of interest change with the BF-RR policy compared to SP0, the standard policy with a = 0.05 and no samplesize requirement. Based on our results, we can make the following three mainconclusions on the impact of adopting a BF-RR policy.98First, adopting a BF-RR policy will result in a substantially higher publicationrate. This is due to the fact that under a BF-RR policy both positive and negativeresults are published. Note however that, for all ecosystems considered, the JCRis extremely low, between 0.02 and 0.11. This suggests that the vast majority ofpublications under a BF-RR policy would be negative. This is no doubt due to thefact that, for the magnitude of effect sizes we consider, the sample size requiredto obtain a negative result with high probability is quite low relative to the samplesize required to obtain a positive result, see Figs 2.4 and 2.5.Indeed, consider the following calculation (recall that we are assuming samplesizes are fixed in advance). The total sample size required to have an 80% chanceof obtaining a negative result when the true effect size is d = 0 is n = 100 (withs2 = 1, and with BFthres = 3). In contrast, the total sample size required to have an80% chance of obtaining a positive result when the true effect size is d = 0.20 isn= 1,424 (with s2 = 1, and with BFthres = 3).Second, studies published under a BF-RR policy will be larger (see Table 4.6“Change in AvgSSPUB”). As was noted in Chapter 1, it is known that sample sizeswill typically need to be larger with Bayesian testing schemes relative to analogousfrequentist testing in situations when there is little prior information incorporated(Zhang et al., 2011).Finally, due to larger sample sizes under the BF-RR policy, reliability is in-creased. The OR, the PREL, and the DREL are all higher under the BF-RR policythan under the SP0 policy.4.2.2 The BF-RR policy relative to SP1 and SP2For the same range of values for m (= 1, 3, 9), k (= 20, 100, 500), and d (= 0.20,0.43), we now compare the BF-RR policy to the two policies of “greater statisticalstringency,” SP1 and SP2. Tables 4.7 and Table 4.8 lead to the following observa-tions:1. In ecosystems where it is more expensive to pursue high-powered studies(i.e., in small d and/or small k ecosystems), studies published under the BF-RR policy are mostly of smaller sample size (see Tables 4.7 and 4.8, “Changein AvgSSPUB”) relative to studies published under the SP1 policy and SP299Ratio of ecosystem with a BF-RR policy to ecosystem with the SP0 policy:d k m PubR OR PREL DSCV DREL AvgSSPUB JCR0.20 20 1 7.97 1.32 1.18 0.93 2.70 0.82 0.050.20 20 3 11.25 2.48 1.50 0.69 2.72 1.22 0.030.20 20 9 13.32 6.41 2.13 0.50 2.76 1.94 0.020.20 100 1 6.48 1.12 1.12 1.35 2.74 0.80 0.060.20 100 3 10.64 1.75 1.35 0.96 2.76 1.00 0.040.20 100 9 14.49 3.88 1.97 0.67 2.82 1.40 0.020.20 500 1 5.15 1.06 1.11 1.89 2.85 0.94 0.090.20 500 3 9.25 1.44 1.30 1.31 2.88 1.04 0.050.20 500 9 14.34 2.69 1.89 0.87 2.98 1.27 0.030.43 20 1 5.56 1.22 1.21 1.32 3.94 2.51 0.160.43 20 3 9.12 1.82 1.56 0.96 3.98 2.97 0.090.43 20 9 12.45 3.91 2.56 0.68 4.13 4.00 0.040.43 100 1 4.89 1.15 1.15 1.92 3.46 2.78 0.190.43 100 3 8.63 1.50 1.39 1.37 3.51 3.01 0.100.43 100 9 13.26 2.75 2.15 0.93 3.67 3.63 0.050.43 500 1 4.51 1.13 1.13 2.91 3.31 3.52 0.230.43 500 3 8.17 1.39 1.33 2.05 3.37 3.71 0.120.43 500 9 13.42 2.32 2.00 1.36 3.55 4.24 0.05Table 4.6: A comparison between ecosystems with a BF-RR policy (with BFthreshold=3) and ecosystems with the SP0 policy. The table lists esti-mates for the ratios of the publication rate (PubR), the ratio of the reliabil-ity metrics (OR, PREL,DREL), the ratio of the number of true breakthroughdiscoveries (DSCV ), the ratio of the average sample size of publishedstudies (AvgSSPUB), and the ratio of the journal content ratio (JCR). R-code to reproduce: http://tinyurl.com/y5y5cqs4.policy.2. Overall reliability, for most cases, is similar for the BF-RR policy and themore “stringent” policies. The positive reliability (PREL) is also similar formost ecosystems with the BF-RR policy relative to the SP1 and SP2 policies.3. The publication rate of the BF-RR is much higher for all scenarios consid-ered (see Tables 4.7 and 4.8, “Change in PubR”). This fact combined withthe very low journal content ratio (see Tables 4.7 and 4.8, “Change in JCR”)suggests that, relative to the SP1 and SP2 policies, the BF-RR policy pub-lishes a large number of studies, the vast majority of which are negative.100Ratio of ecosystem with a BF-RR policy to ecosystem with the SP1 policy:d k m PubR OR PREL DSCV DREL AvgSSPUB JCR0.20 20 1 7.01 0.84 0.74 7.72 0.36 0.21 0.050.20 20 3 15.85 0.96 0.58 4.72 0.35 0.22 0.030.20 20 9 41.71 1.19 0.39 2.43 0.34 0.25 0.020.20 100 1 6.58 0.86 0.86 7.00 0.50 0.32 0.060.20 100 3 14.55 0.96 0.75 4.25 0.50 0.33 0.040.20 100 9 37.40 1.12 0.57 2.17 0.48 0.35 0.020.20 500 1 5.64 0.90 0.94 6.62 0.70 0.48 0.090.20 500 3 12.36 0.97 0.88 4.01 0.70 0.48 0.050.20 500 9 31.38 1.08 0.76 2.01 0.68 0.50 0.030.43 20 1 5.59 0.94 0.93 6.90 0.72 1.00 0.160.43 20 3 12.36 1.00 0.86 4.29 0.71 0.98 0.090.43 20 9 31.80 1.13 0.74 2.22 0.70 1.02 0.040.43 100 1 5.34 0.97 0.97 6.88 0.84 1.42 0.190.43 100 3 11.53 1.01 0.93 4.28 0.84 1.39 0.100.43 100 9 29.11 1.09 0.85 2.20 0.83 1.40 0.050.43 500 1 4.92 0.99 0.99 7.78 0.95 1.99 0.230.43 500 3 10.49 1.02 0.97 4.85 0.95 1.96 0.120.43 500 9 26.37 1.08 0.93 2.48 0.94 1.98 0.05Table 4.7: A comparison between ecosystems with a BF-RR policy (with BFthreshold=3) and ecosystems with the SP1 policy. The table lists esti-mates for the ratios of the publication rate (PubR), the ratio of the reliabil-ity metrics (OR, PREL,DREL), the ratio of the number of true breakthroughdiscoveries (DSCV ), the ratio of the average sample size of publishedstudies (AvgSSPUB), and the ratio of the journal content ratio (JCR). R-code to reproduce: http://tinyurl.com/yxsygcqu.4.2.3 Concluding thoughts on Bayesian RREnthusiasm for Bayesian testing schemes appears to be growing in many socialscience fields; see Wagenmakers et al. (2018) and Dienes and Mclatchie (2018).However, as noted previously, the BF-RR policy still remains on the fringes and ispoorly understood.As was mentioned earlier, the results of our model are limited since we as-sumed throughout that all studies are of fixed sample size design. One featureunique to the BF-RR is that it encourages the use of data-dependent sample sizes.Developing a more in-depth model that considers this aspect would certainly beworthwhile and may provide better insights. It would also be worthwhile to furtherinvestigate which BF thresholds are best suited for different magnitudes of effect101Ratio of ecosystem with a BF-RR policy to ecosystem with the SP2 policy:d k m PubR OR PREL DSCV DREL AvgSSPUB JCR0.20 20 1 5.47 0.92 0.82 2.89 0.68 0.28 0.050.20 20 3 10.83 1.20 0.72 1.94 0.68 0.29 0.030.20 20 9 20.64 1.92 0.64 1.21 0.67 0.30 0.020.20 100 1 5.34 0.95 0.95 2.56 1.14 0.43 0.060.20 100 3 10.33 1.21 0.94 1.70 1.14 0.44 0.040.20 100 9 18.92 1.91 0.97 1.05 1.14 0.46 0.020.20 500 1 4.79 0.99 1.04 2.52 1.74 0.66 0.090.20 500 3 9.24 1.22 1.10 1.66 1.75 0.67 0.050.20 500 9 16.79 1.87 1.31 1.02 1.79 0.69 0.030.43 20 1 4.58 1.04 1.03 2.50 1.64 1.35 0.160.43 20 3 8.87 1.26 1.08 1.70 1.65 1.33 0.090.43 20 9 16.26 1.93 1.27 1.07 1.68 1.36 0.040.43 100 1 4.54 1.07 1.07 2.59 2.09 1.95 0.190.43 100 3 8.62 1.26 1.17 1.76 2.11 1.93 0.100.43 100 9 15.58 1.89 1.48 1.10 2.18 1.97 0.050.43 500 1 4.34 1.09 1.08 3.28 2.40 2.80 0.230.43 500 3 8.18 1.26 1.21 2.23 2.43 2.79 0.120.43 500 9 14.83 1.85 1.60 1.39 2.53 2.86 0.05Table 4.8: A comparison between ecosystems with a BF-RR policy (with BFthreshold=3) and ecosystems with the SP2 policy. The table lists esti-mates for the ratios of the publication rate (PubR), the ratio of the reliabil-ity metrics (OR, PREL,DREL), the ratio of the number of true breakthroughdiscoveries (DSCV ), the ratio of the average sample size of publishedstudies (AvgSSPUB), and the ratio of the journal content ratio (JCR). R-code to reproduce: http://tinyurl.com/yyzw2gcw.sizes. We noted that for small effect sizes, the sample sizes required to obtain apositive result can often be considerably larger than those required to obtain a neg-ative result. This asymmetry may result in an abundance of negative results and avery low JCR.In practice, while many journals have adopted the BF-RR policy, in all of thesejournals it remains optional for researchers. As is the case for standard RR poli-cies, researchers submitting their research to a participating journal are offered thechoice of whether or not to use the BF-RR scheme or a more traditional format. Inaddition, not all journals have implemented the BF-RR identically.Some journals have recently updated the required BF threshold required forpublication. For example, Nature Human Behaviour now requires all Bayesiantests to show a BF of at least 10 or less than 1/10 (Nature Human Behaviour Edito-102rial, 2018). A model such as the one we developed, can be used to better understandhow such a change might effect the published literature. Or perhaps, just like thesuggestion that researchers (in a frequentist setting) “justify their alpha” dependingon the particular circumstances of the study (see Lakens et al. (2018a)), researcherscould be required to “justify” their BF thresholds according to the relative costs andbenefits of obtaining false positives and false negatives in a given study.4.3 A model for the Conditional Equivalence TestingpolicyIn this section we will use our model to investigate the merits of the CET policy weput forth and described in detail in Section 2.4; also see Figure 2.9. The CET policyis suggested as an alternative to a RR publication policy. While we recognize manybenefits of the RR policy (including (1) increased reliability and (2) an increasedrate of breakthrough discoveries), we see two potential shortcomings.First, the need for necessary sample size calculations can often be problematic,(see the previous discussion on this point in Chapter 1). Second, it may be undesir-able to have a majority of published studies be inconclusive (i.e., a very low JCR).Going forward in this section, we are particularly curious as to whether the CETpolicy will offer similar benefits as the RR policy (reliable results and a relativelyhigh number of discoveries), while not suffering (at least not to the same extent)from these two shortcomings. As such, note that going forward, all comparisonstabulated (i.e., comparisons in Table 4.9, and in Table 4.14) will be relative to RR.For our model framework, a CET policy will consider a study with a p1 < a1as positive, while a study with a p2 < a2 (and p1 a1) is considered negative.An inconclusive study has both p1 a1 and p2 a2. Power is defined as theprobability of obtaining a positive result under the alternative hypothesis, (i.e., inour model, Power=Pr(p1 < a1|µd = d )). This follows the CET testing scheme de-tailed in Section 2.1. Previously research strategies (i.e., (psp, pwr) combinations)were restricted within [0,1] x [a,1]. For the CET policy, strategies are restricted to(psp, pwr) within [0,1] x [a1,1].In addition, a CET policy will be characterized by setting A = B. The valueof the C parameter will be zero since no inconclusive studies are published. The103equality A = B, represents the fact that, under a CET policy, the probability ofpublication for a negative result is equal to the probability of publication for apositive result.Since the CET policy does not necessarily require researchers to show that theirtarget sample size is sufficient to ensure a given level of power, it is reasonable toassume that the pwr does not directly impact the probability of publication. Aswas done previously, the psp will impact the probability of publication such that:A= B = (1 psp)m; C = 0. (4.8)We will also consider a version of the CET policy which does require re-searchers to show that their target sample size is sufficient to ensure a given levelof power (labelled as “CET-SSR”). For a CET-SSR policy, we define:A= B =(1 psp)m1+ expn(log19) pwrc50c95c50o , (4.9)where the power, pwr, is the probability of obtaining a positive result when thealternative is true, pwr = Pr(p1 < a1|µd = d ).For any CET policy there are many moving parts to consider. One must de-fine specific levels for a1, a2, and D. The a1 threshold will determine how easyit will be to obtain a positive result, while the a2 threshold and D margin willtogether determine how easy it will be to obtain a negative result. For our mod-eling exercise, we will set as fixed a2 = 0.10. This specific choice is perhapsrather arbitrary but reflects the belief that a “type E” error is most likely less costlythan a type 1 error. We explore a range of values for the equivalence margin,D = (0.1,0.2,0.3,0.4,0.5), and for a1 = (0.005,0.02,0.05). As before, we con-sider two effect sizes, with d = 0.20, and d = 0.43. For each combination of theseparameters we consider the CET policy without SSR (“CET”) and the CET policywith SSR (“CET-SSR”).We will calculate the probability of obtaining each type of result (i.e., positive,negative and inconclusive), using Monte Carlo integration as detailed in Section1042.2. While we calculate the metrics of interest for 30 different configurations (5(D) x 3 (a1) x 2 (SSR)) of the CET policy (see Figures in Section 4.5), we willfocus our interpretation of results on the following four specific configurations:• CET1 – a CET policy with D= 0.20, a1 = 0.05, with no SSR;• CET2 – a CET policy with D= 0.30, a1 = 0.05, with no SSR;• CET3 – a CET policy with D= 0.20, a1 = 0.005, with no SSR; and• CET4 – a CET policy with D= 0.20, a1 = 0.05, with a sample size require-ment.The remainder of this section is structured as follows. In Section 4.3.1, weconsider how our metrics of interest change with a standard CET policy (CET1)compared to the RR policy we examined in Section 4.1. In Section 4.3.2, we in-vestigate the effects of widening the equivalence margin and in particular considerthe operating characteristics of CET2. In Section 4.3.3 we consider the effects oflowering the a1 significance threshold and in particular the CET3 policy. Finally,in Section 4.3.4 we consider the merits of adopting a CET policy with a samplesize requirement, and in particular the CET4 policy.Note that, for brevity’s sake, we refrain from presenting all potential policycomparisons (e.g., the CET policies are not compared to the SP0 policy and themore stringent SP1 and SP2). However, the comparisons between the RR policyand the SP0, SP1, and SP2 policies (Section 4.1), in combination with the compar-isons between the RR policy and the CET policies (Section 4.3) should allow oneto infer basic properties about the indirect comparisons.4.3.1 A standard CET policyTable 4.9 shows how the metrics of interest change with the CET1 policy relativeto with the RR policy considered in Section 4.1. Across most of the range of valuesfor d , k andm, the CET policy shows a lower overall reliability (OR), lower positivereliability (PREL), and lower reliability for discoveries (DREL). This is due to thefact that under the CET policy, sample sizes for published studies are typically105smaller (see AvgSSPUB). We do see however, that there is a substantial increase inthe JCR.The relative merits of this “standard CET policy” compared to the RR policyare highly dependent on d , k, and m. Indeed, for the ecosystem where d = 0.43,k = 500, and m = 9, we see a five-fold increase in the JCR, at the cost of modestdecreases in the reliability metrics and the DSCV metric. In contrast, for d = 0.20,k = 500, and m = 1, we see a 38% increase in JCR and a 23% increase in DSCVwith only minimal declines in the AvgSSPUB and reliability metrics.Ratio of ecosystem with a CET1 policy to ecosystem with the RR policy:d k m PubR OR PREL DSCV DREL AvgSSPUB JCR0.20 20 1 0.69 0.96 0.82 1.60 0.53 0.81 2.060.20 20 3 0.69 0.91 0.71 1.66 0.53 0.84 2.380.20 20 9 0.71 0.88 0.58 1.71 0.51 0.85 2.860.20 100 1 0.83 1.01 0.92 1.39 0.75 0.89 1.640.20 100 3 0.84 0.97 0.87 1.43 0.75 0.91 1.740.20 100 9 0.87 0.95 0.80 1.47 0.75 0.92 1.880.20 500 1 0.91 1.02 0.96 1.23 0.87 0.94 1.380.20 500 3 0.92 0.99 0.94 1.26 0.87 0.95 1.420.20 500 9 0.93 0.98 0.90 1.28 0.87 0.96 1.480.43 20 1 0.25 0.88 0.93 0.63 0.48 0.75 4.660.43 20 3 0.17 0.64 0.81 0.82 0.48 0.74 7.030.43 20 9 0.13 0.41 0.62 1.09 0.48 0.73 10.220.43 100 1 0.28 0.96 1.01 0.50 0.74 0.98 4.190.43 100 3 0.19 0.79 0.97 0.66 0.73 1.07 5.930.43 100 9 0.15 0.64 0.87 0.89 0.73 1.19 7.700.43 500 1 0.33 0.99 1.03 0.50 0.90 1.13 3.670.43 500 3 0.23 0.87 1.03 0.65 0.90 1.27 4.790.43 500 9 0.20 0.78 0.99 0.88 0.90 1.42 5.54Table 4.9: A comparison between ecosystems with the CET1 policy (withD = 0.20, a1 = 0.05, with no SSR) and ecosystems with the RR pol-icy (see Table 4.2). The table lists estimates for the ratios of thepublication rate (PubR), the ratio of the reliability metrics (OR, PREL,DREL), the ratio of the number of true breakthrough discoveries (DSCV ),the ratio of the average sample size of published studies (AvgSSPUB),and the ratio of the journal content ratio (JCR). R-code to reproduce:http://tinyurl.com/y4xtsqtt.In ecosystems where the true effect size is larger (d = 0.43) and m and k valuesare small, JCR values are relatively high; see Table 4.9 and also Table 4.10. Neg-ative results are accepted for publication in theory. However, in certain circum-106d k m PubR OR PREL DSCV DREL JCR AvgSSPUB0.20 20 1 0.17 0.83 0.67 0.28 0.14 0.40 2060.20 20 3 0.15 0.82 0.46 0.51 0.13 0.29 2130.20 20 9 0.14 0.82 0.24 0.88 0.12 0.23 2190.20 100 1 0.29 0.88 0.75 0.22 0.20 0.33 3550.20 100 3 0.25 0.89 0.57 0.40 0.20 0.22 3670.20 100 9 0.23 0.89 0.33 0.70 0.18 0.15 3790.20 500 1 0.40 0.91 0.79 0.14 0.24 0.29 5060.20 500 3 0.35 0.91 0.63 0.25 0.24 0.19 5170.20 500 9 0.32 0.92 0.39 0.42 0.22 0.12 5260.43 20 1 0.09 0.77 0.76 0.47 0.13 0.94 590.43 20 3 0.05 0.58 0.53 1.07 0.13 0.89 560.43 20 9 0.04 0.38 0.26 2.41 0.12 0.84 510.43 100 1 0.12 0.85 0.83 0.26 0.21 0.89 990.43 100 3 0.07 0.73 0.65 0.61 0.20 0.78 980.43 100 9 0.05 0.60 0.38 1.40 0.19 0.65 970.43 500 1 0.16 0.88 0.86 0.11 0.26 0.82 1370.43 500 3 0.10 0.81 0.71 0.25 0.26 0.66 1400.43 500 9 0.08 0.73 0.44 0.58 0.24 0.48 146Table 4.10: For ecosystems defined by the CET1 policy and varying values ofk, m, and d , the table lists estimates for the metrics of interest. These in-clude the publication rate (PubR), reliability metrics (OR, PREL, DREL),the number of true breakthrough discoveries (DSCV ), the average totalsample size of published studies (AvgSSPUB), and the journal contentratio (JCR). R-code to reproduce: http://tinyurl.com/y3l2sq7q.stances the equivalence margin may be far too narrow (relative to the true effectsize) to actually produce many negative publications in practice.Consider for example the case where d = 0.43, m = 1 and k = 20. For thisecosystem, the JCR is 94%. This is due to the fact that, for these settings, thesample size needed in order to obtain (with a high probability) a negative resultgreatly exceeds the sample size required to obtain (with a high probability) a posi-tive result. As a consequence, the published literature under such a policy, in thesecircumstances, will be similar to what would be obtained under the SP0 policy.(Note that for an even narrower margin with D = 0.10, negative publications arevirtually non-existent (i.e., we obtain JCR⇡ 1); see Figure 4.14 in the Section 4.5).In ecosystems where the true effect size is smaller (d = 0.20), we see muchlower JCR values suggesting that negative results will be published more often.Our main conclusion is that the width of the equivalence margin has a very pro-107nounced impact on the JCR and other metrics of interest. Indeed, the number ofnegative publications will greatly depend on how D compares to m, k and d ; seeFigure 4.14.There are other issues of concern as well. Figure 4.2 plots, for d = 0.20, theENP for different values of psp and pwr, under the CET1 policy (i.e., withoutSSR, for D= 0.20, a1 = 0.05, a2 = 0.10). Consider which research strategies (i.e.,which (psp, pwr) combinations) are most highly prized (in terms of the ENP) andnote the distinct bi-modality for ecosystems with k = 20 and k = 100. For theseecosystems where collecting data is relatively costly, we see two distinct strategiesassociated with high ENP values.First, a strategy that has low psp values and high pwr values. This strategywill produce studies with sufficiently narrow confidence intervals so as to poten-tially qualify as “negative results.” From a reliability perspective, this strategy aredesirable, as it will produce both novel, reliable discoveries and published negativeresults.The second strategy, one of low psp and low pwr, is one of studies with nochance of being negative results– the sample size is simply to low. This is becausethe width of the confidence intervals obtained with such sample sizes will almostalways be much larger than the width of the equivalence margin (see orange line inFigure 4.2 for reference; see also Figure 2.1). For these studies, there are only twopossible outcomes: an inconclusive result and a positive result.The inconclusive results will not be published (recall, C = 0) and the posi-tive results, while published, will most likely be false positives, since they areunderpowered and of small psp. Indeed, this second strategy recalls the fear of re-searchers “gaming the system” (see discussion in Section 2.4.1). From a reliabilityperspective, this is undesirable and should be discouraged. We see three possiblesolutions to dis-incentivize this undesirable strategy:1. widen the equivalence margin (i.e., increase D), to make the publication ofnegative results –and hence a strategy of increased power– more appealing;or2. lower the significance threshold for positive results, a1, to make false posi-tive publications more difficult to obtain; or108714311144 23292291 5618117128 11158 1400197 00015 059289329 15242083 511575918 11153 1300165 00014 032105110 6131457 3582337 01137 90092 00010 0k= 20 k= 100 k= 500m= 1m= 3m= 90.2 0.5 0.8 0.2 0.5 0.8 0.2 0.5 0.80.20.50.80.20.50.80.20.50.8PSPPWR50100150200250ENPFigure 4.2: The CET1 policy (d = 0.20): Plots show the expected numberof publications, ENP, for different values of psp and pwr under a CETpolicy with d = 0.20, D= 0.20, a1 = 0.05. The printed numbers corre-spond to values of ENP at psp = 0.2, 0.5, and 0.8; and pwr = 0.2, 0.5,and 0.8. The orange dashed line indicates the value of pwr correspond-ing to a sample size for which the expected width of the (12a2)% CIis equal to the width of the equivalence margin. R-code to reproduce:http://tinyurl.com/yy6bxh2c.1093. impose a sample size requirement (SSR), to discourage low powered studies.In the following three sub-sections we will consider each of these possible solu-tions.4.3.2 A CET policy with a more generous equivalence margind k m PubR OR PREL DSCV DREL JCR AvgSSPUB0.20 20 1 0.32 0.829 0.66 0.32 0.14 0.22 2100.20 20 3 0.28 0.865 0.46 0.55 0.14 0.16 2140.20 20 9 0.26 0.887 0.24 0.94 0.13 0.12 2170.20 100 1 0.44 0.850 0.72 0.25 0.18 0.20 3050.20 100 3 0.38 0.888 0.53 0.43 0.17 0.14 3070.20 100 9 0.35 0.911 0.30 0.72 0.16 0.10 3080.20 500 1 0.54 0.866 0.77 0.14 0.21 0.20 4220.20 500 3 0.46 0.902 0.59 0.24 0.21 0.13 4220.20 500 9 0.42 0.924 0.35 0.41 0.19 0.09 4220.43 20 1 0.15 0.851 0.76 0.76 0.18 0.59 720.43 20 3 0.12 0.810 0.57 1.51 0.18 0.43 750.43 20 9 0.11 0.784 0.32 2.88 0.17 0.31 770.43 100 1 0.24 0.901 0.82 0.51 0.25 0.50 1140.43 100 3 0.19 0.881 0.65 0.99 0.25 0.33 1190.43 100 9 0.18 0.870 0.41 1.86 0.23 0.22 1250.43 500 1 0.31 0.922 0.84 0.21 0.28 0.44 1490.43 500 3 0.26 0.908 0.68 0.41 0.28 0.28 1550.43 500 9 0.24 0.901 0.45 0.77 0.26 0.18 161Table 4.11: A CET policy with a more generous equivalence margin: Forecosystems defined by the CET2 policy (with D= 0.30, a1 = 0.05 andno SSR) and varying values of k, m, and d , the table lists estimatesfor the metrics of interest. These include the publication rate (PubR),reliability metrics (OR, PREL, DREL), the number of true breakthroughdiscoveries (DSCV ), the average total sample size of published studies(AvgSSPUB), and the journal content ratio (JCR). R-code to reproduce:http://tinyurl.com/y5leb83j.We now consider CET2, a CET policy with a1 = 0.05, a2 = 0.10, no SSR,and a wider equivalence margin, with D= 0.30. Table 4.11 provides the metrics ofinterest for the various ecosystems considered.For ecosystems with larger effect sizes (i.e., where µd = 0.43), we see that theJCR ranges between 0.18 and 0.59. This represents a substantial difference withwhat was observed with the narrower equivalence margin (D = 0.20), see Table110113125185198 51442599 622831118127 22163 15003133 00017 09481155128 32372390 5624209982 11158 14002622 00015 052298546 12201662 391375429 01140 1000148 00010 0k= 20 k= 100 k= 500m= 1m= 3m= 90.2 0.5 0.8 0.2 0.5 0.8 0.2 0.5 0.80.20.50.80.20.50.80.20.50.8PSPPWR100200ENPFigure 4.3: The CET2 policy (d = 0.20) : Plots show the number of expectedpublications, ENP, for different values of PSP and PWR under a CETpolicy with µd = 0.20, D = 0.30, a1 = 0.05. The printed numbers cor-respond to values of ENP at psp = 0.2, 0.5, and 0.8; and pwr = 0.2, 0.5,and 0.8. The orange dashed line indicates the value of PWR correspond-ing to a sample size for which the expected width of the (12a2)% CIis equal to the width of the equivalence margin. R-code to reproduce:http://tinyurl.com/y2qnho5g.1114.10. With the wider margin we see that a larger proportion of publications are nownegative. Also, note that average sample sizes for published studies are somewhatlarger. However, the reliability of published positive results (PREL) is left relativelyunchanged.Now consider ecosystems with smaller effect sizes (i.e., where µd = 0.20).Figure 4.3 plots the ENP for different values of PSP and PWR for µd = 0.20,under the CET2 policy. We see that indeed, when D is larger, the low-psp, low-pwr strategy is less favourable (i.e., there is no mode in the lower left-hand cornerof the plots in Figure 4.3, in comparison to Figure 4.2).It is worth noting the potential consequences of a CET policy with too narrowa margin, or toowide a margin. For example, with D= 0.10 and d = 0.43, negativepublications are virtually non-existent (i.e., JCR⇡ 1); see Figure 4.14. At the otherextreme, with D= 0.50 and d = 0.20, average sample sizes of published studies areonly half as large (between 39% and 56%) as with D= 0.20 (all else being equal);see Figure 4.13 in Section 4.5. Indeed, if the equivalence margin is too wide (i.e.,if D is allowed to be too large), the power required to obtain (with high probability)a negative publication will be unsatisfactory and one will obtain an abundance ofunderpowered (unreliable) publications. Also note that, from the point of viewof someone interpreting the result of a study, a margin too large renders entirelymeaningless the distinction between “negative” and “inconclusive.”4.3.3 A CET policy with a lower a1 significance thresholdWe now consider a CET policy with a more stringent a1 = 0.005 significancethreshold. Table 4.12 provides the metrics of interest for the CET3 policy for thevarious ecosystems considered. For this CET policy, we maintain a2 = 0.10, noSSR, and an equivalence margin with D= 0.20.For ecosystems with larger effect sizes (i.e., where µd = 0.43), we see thatoverall reliability (OR) and positive reliability (PREL) rates are almost 100% acrossall values of k and m. The reliability of novel discoveries (DREL) is also veryhigh, with values ranging from 0.75 to 0.81. This represents a substantial (morethan two-fold) increase relative to the numbers we observed for the RR policy inSection 4.1, (compare DREL numbers in Tables 4.12 and 4.4).112d k m PubR OR PREL DSCV DREL JCR AvgSSPUB0.20 20 1 0.39 0.96 0.96 0.16 0.66 0.15 5970.20 20 3 0.35 0.97 0.91 0.26 0.65 0.08 6010.20 20 9 0.32 0.98 0.79 0.43 0.63 0.04 6030.20 100 1 0.43 0.96 0.96 0.14 0.68 0.15 6590.20 100 3 0.38 0.98 0.91 0.24 0.67 0.08 6580.20 100 9 0.35 0.99 0.81 0.39 0.65 0.04 6570.20 500 1 0.48 0.96 0.97 0.10 0.72 0.16 7860.20 500 3 0.42 0.98 0.93 0.17 0.71 0.08 7800.20 500 9 0.39 0.99 0.83 0.27 0.69 0.04 7760.43 20 1 0.14 0.99 0.98 0.31 0.77 0.62 1760.43 20 3 0.10 0.98 0.95 0.70 0.77 0.37 1890.43 20 9 0.10 0.98 0.88 1.52 0.75 0.17 2090.43 100 1 0.18 0.99 0.98 0.26 0.80 0.58 2140.43 100 3 0.14 0.99 0.96 0.57 0.79 0.34 2290.43 100 9 0.13 0.98 0.89 1.22 0.78 0.15 2500.43 500 1 0.23 0.99 0.98 0.14 0.81 0.53 2570.43 500 3 0.18 0.99 0.96 0.30 0.81 0.30 2710.43 500 9 0.16 0.99 0.90 0.63 0.79 0.13 289Table 4.12: ACET policy with a lower a1 significance threshold: For ecosys-tems defined by the CET3 policy (with µd = 0.20, D= 0.20, a1 = 0.005,and no SSR) and varying values of k, m, and d , the table lists estimatesfor the metrics of interest. These include the publication rate (PubR),reliability metrics (OR, PREL, DREL), the number of true breakthroughdiscoveries (DSCV ), the average total sample size of published studies(AvgSSPUB), and the journal content ratio (JCR). R-code to reproduce:http://tinyurl.com/y2c34axz.Relative to the RR policy, the JCR is also much larger. For ecosystems withd = 0.43, we see a one-and-a-half to three-fold increase in the JCR for the CET3policy compared to the RR policy. Finally, the publication rate (PubR) and thenumber of novel discoveries (DSCV ) are both lower relative to the RR policy.Now consider ecosystems with smaller effect sizes (i.e., where µd = 0.20).Figure 4.4 plots the ENP for different values of PSP and PWR for µd = 0.20,under the CET3 policy. All of the reliability metrics (OR, PREL, DREL) are veryhigh, however the JCR and DSCV numbers are both substantially lower than whatwas calculated for both the CET1 policy (the “standard CET policy”) and the RRpolicy.Figures 4.7, 4.8, 4.9, 4.10, 4.11, 4.12, 4.13, 4.14 in Section 4.5 plot the metrics113of interest for CET policies with three different values of a1 = (0.005,0.02,0.05).We note how the impact of a1 on various metrics is highly dependent on the widthof the equivalence margin and the magnitude of the true effect size. For example,see the top-left panel of Figure 4.13, and note that the average sample size ofpublished studies sharply declines with increasing D. The lower row of panelsof Figure 4.13 show a very different dynamic when a1 = 0.05.4.3.4 A CET policy with a SSRd k m PubR OR PREL DSCV DREL JCR AvgSSPUB0.20 20 1 0.33 0.92 0.82 0.20 0.28 0.27 5380.20 20 3 0.29 0.93 0.67 0.36 0.27 0.17 5440.20 20 9 0.26 0.93 0.43 0.61 0.25 0.11 5500.20 100 1 0.37 0.92 0.82 0.18 0.28 0.27 5850.20 100 3 0.32 0.93 0.67 0.32 0.27 0.17 5900.20 100 9 0.29 0.93 0.44 0.55 0.25 0.11 5930.20 500 1 0.43 0.93 0.83 0.12 0.29 0.27 6660.20 500 3 0.37 0.93 0.68 0.22 0.28 0.16 6690.20 500 9 0.33 0.94 0.45 0.37 0.26 0.10 6710.43 20 1 0.12 0.90 0.88 0.33 0.30 0.85 1280.43 20 3 0.07 0.82 0.75 0.77 0.30 0.70 1310.43 20 9 0.05 0.75 0.50 1.82 0.28 0.50 1340.43 100 1 0.15 0.91 0.88 0.25 0.31 0.82 1520.43 100 3 0.09 0.84 0.75 0.58 0.31 0.64 1560.43 100 9 0.07 0.78 0.51 1.36 0.29 0.44 1640.43 500 1 0.18 0.91 0.89 0.12 0.33 0.76 1790.43 500 3 0.12 0.86 0.76 0.27 0.32 0.57 1860.43 500 9 0.10 0.82 0.52 0.63 0.30 0.37 197Table 4.13: A CET policy with a sample size requirement: For ecosystemsdefined by the CET4 policy (with d = 0.20, D = 0.20, a1 = 0.05, andwith a SSR) and varying values of k, m, and d , the table lists estimatesfor the metrics of interest. These include the publication rate (PubR),reliability metrics (OR, PREL, DREL), the number of true breakthroughdiscoveries (DSCV ), the average total sample size of published studies(AvgSSPUB), and the journal content ratio (JCR). R-code to reproduce:http://tinyurl.com/y3vaogl8.We now consider the CET4 policy. Table 4.13 provides the metrics of interestfor the various ecosystems considered. This CET4 policy has a sample size re-quirement, maintains a1 = 0.05 and a2 = 0.10 significance thresholds, and defines11446498192 17161457 3511125259 11137 9001415 00010 042417477 14151354 3310104749 11135 8001213 0009 029235142 8101042 26763327 00027 60097 0007 0k= 20 k= 100 k= 500m= 1m= 3m= 90.2 0.5 0.8 0.2 0.5 0.8 0.2 0.5 0.80.20.50.80.20.50.80.20.50.8PSPPWR255075100125ENPFigure 4.4: The CET3 policy (d = 0.20) : Plots show the number of expectedpublications, ENP, for different values of PSP and PWR under a CETpolicy with µd = 0.20, D = 0.20, a1 = 0.005. The printed numberscorrespond to values of ENP at psp = 0.2, 0.5, and 0.8; and pwr =0.2, 0.5, and 0.8. The orange dashed line indicates the value of PWRcorresponding to a sample size for which the expected width of the (12a2)% CI is equal to the width of the equivalence margin. R-code toreproduce: http://tinyurl.com/y2alfdpr.115352552 1142186 5491361 01155 130090 00015 0301461 1121979 4970301 00150 120080 00013 0161251 071354 3440160 00135 80040 0009 0k= 20 k= 100 k= 500m= 1m= 3m= 90.2 0.5 0.8 0.2 0.5 0.8 0.2 0.5 0.80.20.50.80.20.50.80.20.50.8PSPPWR306090ENPFigure 4.5: The CET4 policy (d = 0.20): Plots show the number of expectedpublications, ENP, for different values of PSP and PWR under a CETpolicy with d = 0.20, D = 0.10, a1 = 0.05 and a SSR. The printednumbers correspond to values of ENP at psp = 0.2, 0.5, and 0.8; andpwr = 0.2, 0.5, and 0.8. The orange dashed line indicates the valueof PWR corresponding to a sample size for which the expected widthof the (1 2a2)% CI is equal to the width of the equivalence margin.R-code to reproduce: http://tinyurl.com/y4a3hmts.116Ratio of ecosystem with CET-SSR policy to ecosystem with RR policy:d k m PubR OR PREL DSCV DREL AvgSSPUB JCR0.20 20 1 1.32 1.06 1.02 1.16 1.06 1.21 1.370.20 20 3 1.34 1.02 1.03 1.17 1.06 1.21 1.360.20 20 9 1.36 1.00 1.05 1.17 1.07 1.22 1.330.20 100 1 1.05 1.05 1.01 1.14 1.05 1.18 1.330.20 100 3 1.07 1.02 1.02 1.14 1.05 1.18 1.310.20 100 9 1.08 1.00 1.04 1.15 1.05 1.19 1.290.20 500 1 0.96 1.05 1.01 1.12 1.04 1.15 1.250.20 500 3 0.97 1.02 1.02 1.12 1.04 1.15 1.240.20 500 9 0.98 1.00 1.03 1.12 1.04 1.15 1.230.43 20 1 0.34 1.03 1.08 0.44 1.13 1.37 4.210.43 20 3 0.23 0.90 1.14 0.60 1.13 1.53 5.510.43 20 9 0.19 0.80 1.18 0.83 1.14 1.73 6.140.43 100 1 0.34 1.02 1.07 0.47 1.12 1.35 3.840.43 100 3 0.24 0.92 1.12 0.63 1.13 1.51 4.880.43 100 9 0.21 0.84 1.17 0.87 1.13 1.68 5.270.43 500 1 0.37 1.02 1.07 0.54 1.11 1.35 3.400.43 500 3 0.28 0.93 1.11 0.71 1.12 1.50 4.130.43 500 9 0.26 0.87 1.15 0.96 1.12 1.64 4.30Table 4.14: The CET-SSR compared to the RR policy: A comparison be-tween ecosystems with the CET4 policy (with d = 0.20, D = 0.20,a1 = 0.05, and with a SSR) and ecosystems with the RR policy (seeTable 4.2). The table lists estimates for the ratios of the publicationrate (PubR), the ratio of the reliability metrics (OR, PREL, DREL), theratio of the number of true breakthrough discoveries (DSCV ), the ra-tio of the average sample size of published studies (MpwrPUB), andthe ratio of the journal content ratio (JCR). R-code to reproduce:http://tinyurl.com/y2bdu6lu.the equivalence margin with D = 0.20. The sample size requirement we considerwill be the same that is considered in policy SP2. Studies are required to committo a sample size that suggests a-priori 80% power.Figure 4.5 plots the ENP for different values of PSP and PWR for d = 0.20,under the CET4 policy. We see that the research strategies associated with thehighest ENP values are those of high pwr and low psp. Indeed, most of the bene-ficial strategies are of a pwr much higher than the minimum pwr required to obtaina negative result.For ecosystems with both smaller and larger effect sizes (i.e., for both d = 0.20and d = 0.43), we see that published studies have sample sizes much larger than117with the CET1 policy (but somewhat smaller than the CET3 policy). The JCRvalues are smaller than those for the CET1 (but much larger than those recordedfor the CET3 policy).Note that when the equivalence margin is narrow (i.e., when D is small) relativeto the magnitude of the true effect size, (e.g., for D 0.20 and d = 0.43), there isa potential for a relatively low DSCV , particularly in circumstances when noveltyis less important for publication (i.e., in ecosystems where m is small); see Table4.14 and Figure 4.17.Importantly, we see that when the CET policy incorporates a SSR, the conse-quences of defining “too wide” an equivalence margin are less far-reaching. Indeedas D increases, a CET-SSR policy simply reduces to the RR policy (since we loseany meaningful distinction between “negative” and “inconclusive”). This is not thecase for the CET policy without a sample size requirement.On the other hand, if the margin is “too narrow,” the CET-SSR policy simplyreduces to the SP2 policy (since in practice, no null findings are ever published).Once again, this is not the case for the standard CET policy without a SSR. A CETpolicy without a SSR, as the margin shrinks, will instead reduce to the SP0 policy.We therefore conclude that adding a sample size requirement to the CET policymitigates the risk of setting an inappropriate equivalence margin.4.3.5 Concluding thoughts on the CET policyAs previously mentioned, we were particularly curious as to whether the CET pol-icy could offer similar benefits as the RR policy (relatively high OR and relativelyhigh DSCV ), while not needing to include a potentially problematic sample sizerequirement, nor suffering (at least not to the same extent) from a low JCR.The short answer to this question is that, for some particular circumstances,with an appropriate equivalence margin, and suitable values for a1 and a2, yes.Figures 4.7 - 4.20 plot a number of metrics of interest for all the different config-urations we considered. We see that for certain configurations of the CET policy,researchers are indeed incentivized to pursue large novel studies (i.e., high pwr,low psp studies) while the literature remains relevant and is not a “dreaded dump-ing ground” full of null results (i.e., is not of high JCR values).118However, our results also suggest that the operating characteristics of the CETpolicy are highly sensitive to the chosen width of the equivalence margin. This isparticularly true when there is no sample size requirement. When there is a samplesize requirement included in the CET policy (i.e., the CET-SSR policy), we see thatthe choice of D has less of an impact on many of the metrics of interest, see Figures4.7, 4.8, 4.9, 4.10, 4.11, 4.12, 4.13, 4.14 in Section 4.5. Furthermore, we observedthat the CET-SSR policy fits into the void between the SP2 policy (equivalent tothe CET-SSR policy with D = 0) and the RR policy (equivalent to the CET-SSRpolicy with D= •); see Figure 4.6.We can also think of the CET policy as a modified RR policy with an insurancefor inaccurate power calculations. Indeed, to publish more reliable findings, it is ina journal’s best interest to publish high-powered research. However, with a stan-dard RR policy, while researchers may be compelled to satisfy certain sample sizerequirements, it is often in their best interest to be overly optimistic in their powercalculations (particularly if data are expensive). With a suitable CET policy, it is inboth the journal’s and the researcher’s best interest to conduct large high-poweredstudies. As such, under a CET-SSR policy, “the optimal research strategy for indi-vidual scientists [better] aligns with the optimal conditions for the advancement ofknowledge” (Higginson and Munafo`, 2016).In many ways, a CET-SSR policy is similar to a RR policy with a protectionagainst making erroneous assumptions in a priori power calculations. Journal ed-itors should be pleased to endorse a policy which protects against the “dreadeddumping ground.” And researchers should have nothing to fear. If their power cal-culations are accurate, researchers can fully expect that a CET-SSR’s conditionalin principle acceptance will deliver a publication.4.4 ConclusionIn this chapter we considered three alternative publication policies: the RR policy,the RR-BF policy and the CET policy, by using our simple model. While ourmodeling was limited in many ways, we were nevertheless able to gain valuableinsights with regards to how these policies, if fully implemented, could influencethe incentives driving research. In practice, we anticipate that implementing any119Figure 4.6: The CET and CET-SSR policies are highly sensitive to the chosenwidth of the equivalence margin. We observed that the CET-SSR policyfits into the void between the SP2 policy (equivalent to the CET-SSRpolicy with D= 0) and the RR policy (equivalent to the CET-SSR policywith D = •). In contrast, the CET policy fits into the void between theSP0 policy and an “everything is published” policy.alternative publication policy will require a delicate balance between incentivizingresearchers and encouraging journal editors.In a similar exercise, albeit one derived using a econometrics framework, Frankeland Kasy (2018) model the impacts of various rules and criteria for publication andconclude that there is: “a benefit of publishing extreme results[...] [b]ut [...] alsofind a new benefit of publishing precise results, even precise null results that don’tchange the current action.” These types of modeling exercises are important for120better understanding, and for establishing the desirability of, alternative publica-tion policies prior to their implementation. In order to avoid any unintended conse-quences of a given policy change, we should carefully consider a policy’s impactswith regards to both researchers and journals.We also wish to note that once a novel policy is implemented, a continuous au-diting of how the new policy works in practice is required. For example, Spezi et al.(2018) investigate how soundness-only peer review (SOPR) policies once imple-mented, are maintained in “open-acess mega-journals.” As another example, mostrecently, Goldacre et al. (2019) investigated five high-impact journals, all of whichendorse the Consolidated Standards of Reporting Trials (CONSORT) guidelines.Goldacre et al. (2019) conclude that, while these journals all have publication poli-cies which explicitly support CONSORT, in practice, all exhibit extensive breachesof their advertised policies.4.5 Chapter 4 - Additional figuresCode to reproduce all Results, Tables and Figures in this Chapter– Please note thatall code to produce the results, tables and figures in this chapter have been postedto a repository on the Open Science Framework, DOI 10.17605/OSF.IO/YQCVA.121noSSRµd=0.2noSSRµd=0.43SSRµd=0.2SSRµd=0.43α1 =0.005α1 =0.02α1 =0.050.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.50.250.500.750.250.500.750.250.500.75ΔPWR.PUB.MEDk20100500m139Figure 4.7: Values of MpwrPUB for CET policy with varying values of k, m, D, and a1(= 0.005,0.02,0.05)(rows); results for d = 0.2 and d = 0.43, with and without SSR (columns). R-code to reproduce:http://tinyurl.com/yxf9tuge.122noSSRµd=0.2noSSRµd=0.43SSRµd=0.2SSRµd=0.43α1 =0.005α1 =0.02α1 =0.050.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5012301230123ΔDSCVk20100500m139Figure 4.8: Values of DSCV for CET policy with varying values of k, m, D, and a1(= 0.005,0.02,0.05) (rows); resultsfor d = 0.2 and d = 0.43, with and without SSR (columns). R-code to reproduce: http://tinyurl.com/yyczjm45.123noSSRµd=0.2noSSRµd=0.43SSRµd=0.2SSRµd=0.43α1 =0.005α1 =0.02α1 =0.050.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.50.00.20.40.60.00.20.40.60.00.20.40.6ΔPUBRATEk20100500m139Figure 4.9: Values of PubR for CET policy with varying values of k, m, D, and a1(= 0.005,0.02,0.05) (rows); resultsfor d = 0.2 and d = 0.43, with and without SSR (columns). R-code to reproduce: http://tinyurl.com/yycreeuf.124noSSRµd=0.2noSSRµd=0.43SSRµd=0.2SSRµd=0.43α1 =0.005α1 =0.02α1 =0.050.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.50.40.60.81.00.40.60.81.00.40.60.81.0ΔORk20100500m139Figure 4.10: Values of OR for CET policy with varying values of k, m, D, and a1(= 0.005,0.02,0.05) (rows); resultsfor d = 0.2 and d = 0.43, with and without SSR (columns). R-code to reproduce: http://tinyurl.com/y3nwxdjn.125noSSRµd=0.2noSSRµd=0.43SSRµd=0.2SSRµd=0.43α1 =0.005α1 =0.02α1 =0.050.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.50.250.500.751.000.250.500.751.000.250.500.751.00ΔPRELk20100500m139Figure 4.11: Values of PREL for CET policy with varying values of k, m, D, and a1(= 0.005,0.02,0.05) (rows); resultsfor d = 0.2 and d = 0.43, with and without SSR (columns). R-code to reproduce: http://tinyurl.com/y3l2sq7q.126noSSRµd=0.2noSSRµd=0.43SSRµd=0.2SSRµd=0.43α1 =0.005α1 =0.02α1 =0.050.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.50.20.40.60.80.20.40.60.80.20.40.60.8ΔDRELk20100500m139Figure 4.12: Values of DREL for CET policy with varying values of k, m, D, and a1(= 0.005,0.02,0.05) (rows); resultsfor d = 0.2 and d = 0.43, with and without SSR (columns). R-code to reproduce: http://tinyurl.com/y5leb83j.127noSSRµd=0.2noSSRµd=0.43SSRµd=0.2SSRµd=0.43α1 =0.005α1 =0.02α1 =0.050.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5050010001500050010001500050010001500ΔavgSampleSize.PUBk20100500m139Figure 4.13: Values of avgSSPUB for CET policy with varying values of k, m, D, and a1(= 0.005,0.02,0.05)(rows); results for d = 0.2 and d = 0.43, with and without SSR (columns). R-code to reproduce:http://tinyurl.com/y36tzkgl.128noSSRµd=0.2noSSRµd=0.43SSRµd=0.2SSRµd=0.43α1 =0.005α1 =0.02α1 =0.050.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.50.000.250.500.751.000.000.250.500.751.000.000.250.500.751.00ΔJCRk20100500m139Figure 4.14: Values of JCR for CET policy with varying values of k, m, D, and a1(= 0.005,0.02,0.05) (rows); resultsfor d = 0.2 and d = 0.43, with and without SSR (columns). R-code to reproduce: http://tinyurl.com/y4uho96n.129noSSRµd=0.2noSSRµd=0.43SSRµd=0.2SSRµd=0.43α1 =0.005α1 =0.02α1 =0.050.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.50.40.60.81.00.40.60.81.00.40.60.81.0ΔRatio of OR (CET vs. RR)k20100500m139Figure 4.15: A comparison between CET ecosystems with the RR policy (see Table 4.2). The figure plots estimates forthe ratios of the OR with varying values of k, m, D, and a1(= 0.005,0.02,0.05) (rows); results for d = 0.2 andd = 0.43, with and without SSR (columns). R-code to reproduce: http://tinyurl.com/y4pl27mx.130noSSRµd=0.2noSSRµd=0.43SSRµd=0.2SSRµd=0.43α1 =0.005α1 =0.02α1 =0.050.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.50.51.01.52.00.51.01.52.00.51.01.52.0ΔRatio of PREL (CET vs. RR) k20100500m139Figure 4.16: A comparison between CET ecosystems with the RR policy (see Table 4.2). The figure plots estimates forthe ratios of the PREL with varying values of k, m, D, and a1(= 0.005,0.02,0.05) (rows); results for d = 0.2 andd = 0.43, with and without SSR (columns). R-code to reproduce: http://tinyurl.com/y44hwnkd.131noSSRµd=0.2noSSRµd=0.43SSRµd=0.2SSRµd=0.43α1 =0.005α1 =0.02α1 =0.050.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.50.00.51.01.52.00.00.51.01.52.00.00.51.01.52.0ΔRatio of DSCV (CET vs. RR) k20100500m139Figure 4.17: A comparison between CET ecosystems with the RR policy (see Table 4.2). The figure plots estimates forthe ratios of the DSCV with varying values of k, m, D, and a1(= 0.005,0.02,0.05) (rows); results for d = 0.2and d = 0.43, with and without SSR (columns). R-code to reproduce: http://tinyurl.com/y56vt8jf.132noSSRµd=0.2noSSRµd=0.43SSRµd=0.2SSRµd=0.43α1 =0.005α1 =0.02α1 =0.050.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5123123123ΔRatio of DREL (CET vs. RR) k20100500m139Figure 4.18: A comparison between CET ecosystems with the RR policy (see Table 4.2). The figure plots estimates forthe ratios of the DREL with varying values of k, m, D, and a1(= 0.005,0.02,0.05) (rows); results for d = 0.2 andd = 0.43, with and without SSR (columns). R-code to reproduce: http://tinyurl.com/yy6j3tvo.133noSSRµd=0.2noSSRµd=0.43SSRµd=0.2SSRµd=0.43α1 =0.005α1 =0.02α1 =0.050.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5110110110ΔRatio of JCR (CET vs. RR)k20100500m139Figure 4.19: A comparison between CET ecosystems with the RR policy (see Table 4.2). The figure plots estimates forthe ratios of the JCR with varying values of k, m, D, and a1(= 0.005,0.02,0.05) (rows); results for d = 0.2 andd = 0.43, with and without SSR (columns). R-code to reproduce: http://tinyurl.com/y5nryauh.134noSSRµd=0.2noSSRµd=0.43SSRµd=0.2SSRµd=0.43α1 =0.005α1 =0.02α1 =0.050.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.50.51.01.52.02.50.51.01.52.02.50.51.01.52.02.5ΔRatio of avgSampleSize.PUB (CET vs. RR)k20100500m139Figure 4.20: A comparison between CET ecosystems with the RR policy (see Table 4.2). The figure plots estimates forthe ratios of the avgSSPUB with varying values of k, m, D, and a1(= 0.005,0.02,0.05) (rows); results for d = 0.2and d = 0.43, with and without SSR (columns). R-code to reproduce: http://tinyurl.com/yyp72aqa.135Chapter 5ConclusionBack when we were reading Karl Popper’s Logic of ScientificDiscovery and Thomas Kuhn’s Structure of Scientific Revolutions,who would’ve thought that we’d be living through a scientificrevolution ourselves? When it comes to technical difficulty and sheerimportance of the scientific theories being disputed, this recentreplication revolution is by far more trivial than the earlierrevolutions in biology and physics. Still, within its narrowparameters, a revolution it is.— The competing narratives of scientific revolution,Andrew Gelman (2018)Statistics allow us to better understand the world around us by helping us todistinguish signal from noise. As such, statistics are fundamental to the advance-ment of science. However, signal and noise do not exist in a void. Data only existif they are observed and recorded, and statistics will only come to be when they arecalculated from these data.As it turns out, the way in which statistics are obtained will greatly impacttheir behaviour. Failure to take this into account has resulted in the unfortunatesituation in which science has identified far too many signals in situations wheninstead, we should have recognized that we were only looking at noise. In otherwords, our ability to distinguish signal from noise has been compromised becausewe have failed to appreciate the extent to which statistics are the product of theirvery creation.136When we are presented with a statistic, it is of very little value to us unless weknow how it is that this statistic came to be in front of us. Therefore, in order toadd value to statistics, we should not only work to improve their interpretation, butalso strive to shape the context of their creation. Developing innovative publicationpolicy is on way to advance this goal.Publication bias has been recognized as a serious problem for several decadesnow (Rosenthal, 1979). Yet, in many fields, it is only getting worse (Franco et al.,2014, Pautasso, 2010). An investigation by Ku¨hberger et al. (2014) concludes thatthe “entire field of psychology” is now tainted by “pervasive publication bias.”Moreover, statistical power in many fields continues to be very low (Lamberinket al., 2018).There remains substantial disagreement on the merits of requiring greater sta-tistical stringency. The debate over pre-registration and result-blind peer-reviewis also far from over (e.g., Coffman and Niederle (2015), de Winter and Happee(2013), van Assen et al. (2014)). Indeed, registration prior to data collection maybe problematic for observational studies. Pre-registration publication policies infields such as epidemiology and political science may not work as well. However,all should agree that innovative publication policy prescriptions can and should bepart of the solution to the “reproducibility crisis.”We conclude with two main recommendations. First, the treatment of null (ornegative) findings must change and the scientific community must offer incentivesfor their publication; see Nature Human Behaviour Editorial (2019). We recom-mend that journal editors consider implementing a CET-SSR type policy to do justthis. The RR policy and CET policy are also worth considering. Second, beforeadopting any (radical) policy changes (Brembs, 2019), we should take a momentto carefully consider, and model how, these proposed changes might impact out-comes. The type of model and methodology we considered in Chapters 3 and 4can be used to do just this.While some call for dropping p-values and strict thresholds of evidence alto-gether, we believe that it is not worthwhile to fight “the temptation to discretizecontinuous evidence and to declare victory” (Gelman and Carlin, 2017). Instead,the research community should embrace this “temptation” and work to leverage itin order to achieve desirable outcomes. Indeed, one way to address the “practical137difficulties that reviewers face with null results” (Findley et al., 2016) is to furtherdiscretize continuous evidence by means of equivalence testing and we submit thatCET can be an effective tool for distinguishing those “high-quality null results”(Shields et al., 2009) worthwhile of publication.Recently, a number of influential researchers have argued that to address lowreliability, scientists, reviewers and regulators should “abandon statistical signifi-cance” (McShane et al., 2017). To the contrary, in some fields, such as reinforce-ment learning, the currently proposed solution is just the opposite: the adoption of“significance metrics and tighter standardization of experimental reporting” (Hen-derson et al., 2018). A recent comment piece in Nature (accompanied by hundredsof supporting signatures) (Amrhein et al., 2019) called for scientists to “retire sta-tistical significance.” While we appreciate these efforts by statisticians to improvethe interpretation (and reliability) of science, we share John Ioannidis’ concern:“Removing statistical significance entirely will allow anyone to make any over-stated claim about any result being important. It will also facilitate claiming thatthat there are no conflicts between studies when conflicts do exist” (Gelman, 2019);see also Ioannidis (2019). In particular, we wish to emphasize Ioannidis’ followingremarks: “Banning statistical significance while retaining P values (or confidenceintervals) will not improve numeracy and may foster statistical confusion and cre-ate problematic issues with study interpretation, a state of statistical anarchy. [...]Without clear rules for the analyses, science and policy may rely less on data andevidence and more on subjective opinions and interpretations,” (Ioannidis, 2019).We recognize that current publication policies, in which evidence is dichotomizedmay be highly unsatisfactory (Amrhein et al., 2017). However, one benefit toadopting distinct categories based on clearly defined thresholds (as in the CETpolicy) is that one can assess, in a systematic way, the state of published research.While using a “more holistic view of the evidence” (McShane et al., 2017) to in-form publication decisions may (or may not?) prove effective, “meta-research” un-der such a paradigm is clearly less feasible. As such, the question of effectivenessmay perhaps never be adequately answered.Bayesian approaches offer many benefits. However, we see two main draw-backs. First, adopting Bayesian testing requires a substantial paradigm shift. Sincethe interpretation of findings deemed significant by traditional NHST may differ138with Bayes, some will no doubt be reluctant to accept the shift. With CET, thetraditional usage and interpretation of the p-value remains unchanged, except incircumstances when one fails to reject the null. As such, CET does not change theinterpretation of findings already established as significant. Indeed, CET simply“extend[s] the arsenal of confirmatory methods rooted in the frequentist paradigmof inference” (Wellek, 2017). Second, as we observed in our simulation study inChapter 2, Bayesian testing is potentially less powerful than NHST (in the tradi-tional frequentist sense with standard Bayesian thresholds of evidence) and as suchcould require substantially larger sample sizes.We concluded Chapter 2 with the thought that the CET publication policyshould be welcomed by journal editors, researchers and all those who wish to seemore reliable science. However, we emphasize that any policy will inevitably yielddifferent results depending on the discipline. Indeed, each scientific field is uniqueand implementing policy reforms is made all the more difficult by the “balkaniza-tion of research methods and statistical traditions across scientific subdisciplines”(Goodman, 2019). The models presented in Chapters 3 and 4 show one methodfor how to better determine which policies will work best for a given research en-vironment. Further modelling, tailored to the idiosyncrasies of a given (sub-)field,would be required to provide more concrete guidance.5.1 Future researchThere are many potential areas for further research. With regards to CET, deter-mining whether D and/or a1 and/or a2 should be chosen with consideration of thesample size is important and not trivial; related work includes Pe´rez and Pericchi(2014) who put forward a “Bayes/non-Bayes compromise” in which the a-level ofa confidence interval changes with n.Issues which have proven problematic for standard equivalence testing mustalso be addressed for CET. These include multiplicity control (Lauzon and Caffo,2009) and potential problems with interpretation; see Aberegg et al. (2017), andmost recently Campbell and Gustafson (2018b). The impact of a CET type pol-icy on meta-analysis should also be examined (Hedges, 1992), i.e., how shouldone account for the exclusion of inconclusive results in the published literature139when deriving estimates in a meta-analysis? However, note that the CET policywe proposed does mandate that a public record is kept by the journal for every un-published inconclusive study; see Section 2.4.1. This feature should alleviate theCET policy’s impact on subsequent meta-analysis and research synthesis.Finally, Bloomfield et al. (2018) consider how different publication policieswill lead to “changes [in] the nature and timing of authors’ investment in theirwork.” It remains to be seen to what extent the CET-based policy, in which the de-cision to publish is based both on the up-front study registration and on the resultsobtained after analysis, will lead to a “shift from follow-up investment to up-frontinvestment.”5.2 Limitations of the optimality modelThe model considered in Chapters 3 and 4 assumes above all that researchers’ de-cisions are driven exclusively by the desire to publish. But the situation is morecomplex. Publication is not necessarily the end goal for a scientific study and re-quirements with regards to significance and power are not only encountered at thepublication stage. In the planning stages, before a study even begins, ethics com-mittees and granting agencies will often have certain minimal requirements; seePloutz-Snyder et al. (2014) and Halpern et al. (2002). After a study is published,regulatory bodies and policy makers will also often subject the results to a differentset of norms. Furthermore, once a study is published -with or without publicationbias– other biases, such as “citation bias” will work to distort scientific interpreta-tions; see de Vries et al. (2018).We also assumed a framework of independent Bernoulli trials, i.e., we assumedthat each study was an independent test of a true or a false hypothesis. Of course,in practice, most studies are not conducted independently. For example, in the con-text of clinical trials, phase 3 studies are conducted based on the success or failureof earlier phase 2 studies; see Burt et al. (2017) who model the advantages anddisadvantages of different clinical development strategies. It is difficult to antici-pate the consequences of failing to incorporate such dependencies in our model. Amore elaborate model, in which studies are correlated and the psp of subsequentstudies is updated appropriately, could prove very informative.140Finally, it is important to acknowledge that no publication policy will be per-fect. Science is inherently challenging and we must always be willing to accept thata certain proportion of research is potentially false (Contopoulos-Ioannidis et al.,2003, Djulbegovic and Hozo, 2007). Each policy will have its advantages and dis-advantages. Our modelling exercise makes this all the more evident and forces usto carefully consider different potential trade-offs.141BibliographyS. K. Aberegg, A. M. Hersh, and M. H. Samore. Empirical consequences ofcurrent recommendations for the design and interpretation of noninferioritytrials. Journal of General Internal Medicine, 2017. ISSN 1525-1497. ! page139C. Albers and D. Lakens. When power analyses based on pilot data are biased:Inaccurate effect size estimators and follow-up bias. Journal of ExperimentalSocial Psychology, 74:187–195, 2018. ! page 4C. P. Allen and D. M. Mehler. Open science challenges, benefits and tips in earlycareer and beyond. PsyArXiv, 2018. doi:https://doi.org/10.31234/osf.io/3czyt.! pages 3, 88D. G. Altman and J. M. Bland. Statistics notes: Absence of evidence is notevidence of absence. The BMJ, 311(7003):485, 1995. ! page 2V. Amrhein, F. Korner-Nievergelt, and T. Roth. The earth is flat (p> 0.05):significance thresholds and the crisis of unreplicable research. PeerJ, 5:e3544,2017. ! pages 11, 38, 60, 138V. Amrhein, S. Greenland, and B. McShane. Scientists rise up against statisticalsignificance. Nature, 567:305–307, 2019.doi:10.1080/00031305.2018.1555101. ! page 138O. L. O. Astivia, A. Gadermann, and M. Guhn. The relationship betweenstatistical power and predictor distribution in multilevel logistic regression: asimulation-based approach. BMC medical research methodology, 19(1):97,2019. ! page 3L. C. S. Aycaguer and P. A. Galba´n. Explicacio´n del taman˜o muestral empleado:una exigencia irracional de las revistas biome´dicas. Gaceta Sanitaria, 27(1):53–57, 2013. ! pages 3, 38142P. Bacchetti. Peer review of statistics in medical research: the other problem. TheBMJ, 324(7348):1271–1273, 2002. ! pages 3, 38M. Bakker, A. van Dijk, and J. M. Wicherts. The rules of the game calledpsychological science. Perspectives on Psychological Science, 7(6):543–554,2012. ! pages 27, 42L. Barker, H. Rolka, D. Rolka, and C. Brown. Equivalence testing for binomialrandom variables: which test to use? The American Statistician, 55(4):279–287, 2001. ! page 13P. L. Bedard, M. K. Krzyzanowska, M. Pintilie, and I. F. Tannock. Statisticalpower of negative randomized controlled trials presented at american societyfor clinical oncology annual meetings. Journal of Clinical Oncology, 25(23):3482–3487, 2007. ! page 12C. G. Begley and L. M. Ellis. Drug development: Raise standards for preclinicalcancer research. Nature, 483(7391):531–533, 2012. ! page 1C. G. Begley and J. P. Ioannidis. Reproducibility in science: improving thestandard for basic and preclinical research. Circulation Research, 116(1):116–126, 2015. ! page 49D. J. Benjamin, J. O. Berger, M. Johannesson, B. A. Nosek, E.-J. Wagenmakers,R. Berk, K. A. Bollen, B. Brembs, L. Brown, C. Camerer, et al. Redefinestatistical significance. Nature Human Behaviour, 2(1):6, 2018. ! page 38J. O. Berger and M. Delampady. Testing precise hypotheses. Statistical Science, 2(3):317–335, 1987. ! page 20R. L. Berger, J. C. Hsu, et al. Bioequivalence trials, intersection-union tests andequivalence confidence sets. Statistical Science, 11(4):283–319, 1996. ! page9A. Berry. Subgroup analyses. Biometrics, 46(4):1227–1230, 1990. ! page 22J. M. Bland. The tyranny of power: is there a better way to calculate sample size?The BMJ, 339:b3985, 2009. ISSN 0959-8138. ! pages 4, 56R. Bloomfield, K. Rennekamp, and B. Steenhoven. No system is perfect:Understanding how registration-based editorial processes affect reproducibilityand investment in research quality. Journal of Accounting Research, 56(2):313–362, 2018. ! page 140143J. D. Blume, L. D. McGowan, W. D. Dupont, and R. A. Greevy Jr.Second-generation p-values: Improved rigor, reproducibility, & transparency instatistical analyses. PLoS One, 13(3):e0188299, 2018. ! page 11J. D. Blume, R. A. Greevy, V. F. Welty, J. R. Smith, and W. D. Dupont. Anintroduction to second-generation p-values. The American Statistician, 73(sup1):157–167, 2019. ! page 11BMC Biology Editorial Board. Registered reports.https://bmcbiol.biomedcentral.com/about/registered-reports, 2017. ! pages5, 97G. F. Borm, M. den Heijer, and G. A. Zielhuis. Publication bias was not a goodreason to discourage trials with low power. Journal of Clinical Epidemiology,62(1):47–53, 2009. ! pages 3, 39T. Bosker, J. F. Mudge, and K. R. Munkittrick. Statistical reporting deficiencies inenvironmental toxicology. Environmental Toxicology and Chemistry, 32(8):1737–1739, 2013. ! page 38B. Brembs. Reliable novelty: New should not trump true. PLoS Biology, 17(2):e3000117, 2019. ! page 137B. Brembs, K. Button, and M. Munafo`. Deep impact: unintended consequencesof journal rank. Frontiers in Human Neuroscience, 7:291, 2013. ! page 42T. Burt, K. Button, H. Thom, R. Noveck, and M. R. Munafo`. The burden of thefalse-negatives in clinical development: Analyses of current and alternativescenarios and corrective measures. Clinical and Translational Science, 10(6):470–479, 2017. ! page 140K. S. Button, J. P. Ioannidis, C. Mokrysz, B. A. Nosek, J. Flint, E. S. Robinson,and M. R. Munafo`. Empirical evidence for low reproducibility indicates lowpre-study odds. Nature Reviews Neuroscience, 14(12):877–877, 2013a. !pages 39, 42K. S. Button, J. P. Ioannidis, C. Mokrysz, B. A. Nosek, J. Flint, E. S. Robinson,and M. R. Munafo`. Power failure: why small sample size undermines thereliability of neuroscience. Nature Reviews Neuroscience, 14(5):365–376,2013b. ! pages 27, 42, 50C. F. Camerer, A. Dreber, E. Forsell, T.-H. Ho, J. Huber, M. Johannesson,M. Kirchler, J. Almenberg, A. Altmejd, T. Chan, et al. Evaluating replicability144of laboratory experiments in economics. Science, 351(6280):1433–1436, 2016.! page 1H. Campbell and P. Gustafson. Conditional equivalence testing: An alternativeremedy for publication bias. PLoS ONE, 13(4):e0195145, 2018a. ! page 27H. Campbell and P. Gustafson. What to make of non-inferiority and equivalencetesting with a post-specified margin? arXiv preprint arXiv:1807.03413, 2018b.! pages 13, 139H. Campbell and P. Gustafson. The world of research has gone berserk: Modelingthe consequences of requiring “greater statistical stringency” for scientificpublication. The American Statistician, 73(sup1):358–373, 2019.doi:10.1080/00031305.2018.1555101. URLhttps://doi.org/10.1080/00031305.2018.1555101. ! page 62I. Chalmers and R. Matthews. What are the implications of optimism bias inclinical research? The Lancet, 367(9509):449–450, 2006. ! page 4C. Chambers. personal communication via email (2017-02-24), 2017. ! page 4C. Chambers. “latest citation analysis of registered reports”. Twitter, @chrisdc77,2018a. URL https://twitter.com/chrisdc77/status/1028987745782386690. !page 96C. Chambers. “replying to @drkatemills : A sadly common view...”. Twitter,@chrisdc77, 2018b. URLhttps://twitter.com/chrisdc77/status/1072197688828051456. ! page 88C. D. Chambers, E. Feredoes, S. D. Muthukumaraswamy, and P. Etchells. Insteadof “playing the game” it is time to change the rules: Registered Reports atAIMS Neuroscience and beyond. AIMS Neuroscience, 1(1):4–17, 2014. !pages 3, 22C. D. Chambers, Z. Dienes, R. D. McIntosh, P. Rotshtein, and K. Willmes.Registered reports: realigning incentives in scientific publishing. Cortex, 66:A1–A2, 2015. ! page 3A.-W. Chan, A. Hro´bjartsson, K. J. Jørgensen, P. C. Gøtzsche, and D. G. Altman.Discrepancies in sample size calculations and data analyses reported inrandomised trials: comparison of publications with protocols. The BMJ, 337:a2299, 2008. ! pages 4, 23145P. Charles, B. Giraudeau, A. Dechartres, G. Baron, and P. Ravaud. Reporting ofsample size calculation in randomised controlled trials. The BMJ, 338:b1732,2009. ! page 38B. G. Charlton and P. Andras. How should we rate research?: Counting number ofpublications may be best research performance measure. The BMJ, 332(7551):1214–1215, 2006. ! pages 26, 46D. Chavalarias, J. D. Wallach, A. H. T. Li, and J. P. Ioannidis. Evolution ofreporting p-values in the biomedical literature, 1990-2015. The Journal of theAmerican Medical Association, 315(11):1141–1148, 2016. ! page 1J. J. Chen, Y. Tsong, and S.-H. Kang. Tests for equivalence or noninferioritybetween two proportions. Drug Information Journal, 34(2):569–578, 2000. !page 9S. Choi. Interim analyses and early termination of clinical trials. Journal ofBiopharmaceutical Statistics, 7(4):533–543, 1997. ! page 24P. J. Chuard, M. Vrtı´lek, M. L. Head, and M. D. Jennions. Evidence thatnonsignificant results are sometimes preferred: Reverse p-hacking or selectivereporting? PLoS Biology, 17(1):e3000127, 2019. ! page 27L. C. Coffman and M. Niederle. Pre-analysis plans have limited upside, especiallywhere replications are feasible. Journal of Economic Perspectives, 29(3):81–98, 2015. ! page 137B. A. Cohen. How should novelty be valued in science? eLife, 6, 2017. ! page42J. Cohen. The statistical power of abnormal-social psychological research: areview. The Journal of Abnormal and Social Psychology, 65(3):145–153, 1962.! pages 27, 41M. Coltheart, R. Cox, P. Sowman, H. Morgan, A. Barnier, R. Langdon,E. Connaughton, L. Teichmann, N. Williams, and V. Polito. Belief, delusion,hypnosis, and the right dorsolateral prefrontal cortex: A transcranial magneticstimulation study. Cortex, 101:234–248, 2018. ! page 97D. G. Contopoulos-Ioannidis, E. Ntzani, and J. Ioannidis. Translation of highlypromising basic science research into clinical applications. The AmericanJournal of Medicine, 114(6):477–484, 2003. ! page 141146Cortex Editorial Board. Cortex Editorial Policy, 2018. URLhttps://goo.gl/abzBCp. ! page 95G. Cumming. Replication and p-intervals: p-values predict the future onlyvaguely, but confidence intervals do much better. Perspectives onPsychological Science, 3(4):286–300, 2008. ! page 10G. Cumming. The new statistics: Why and how. Psychological Science, 25(1):7–29, 2014. ! page 1G. T. da Silva, B. R. Logan, and J. P. Klein. Methods for equivalence andnoninferiority testing. Biology of Blood and Marrow Transplantation, 15(1):120–127, 2009. ! pages 9, 13, 14O. Dannenberg, H. Dette, and A. Munk. An extension of welch’s approximatet-solution to comparative bioequivalence trials. Biometrika, 81(1):91–101,1994. ! page 9P. Dasgupta and E. Maskin. The simple economics of research portfolios. TheEconomic Journal, 97(387):581–595, 1987. ! page 40Y. de Vries, A. Roest, P. de Jonge, P. Cuijpers, M. Munafo`, and J. Bastiaansen.The cumulative effect of reporting and citation biases on the apparent efficacyof treatments: the case of depression. Psychological Medicine, page 1, 2018.! page 140J. de Winter and R. Happee. Why selective publication of statistically significantresults can be effective. PLoS ONE, 8(6):e66463, 2013. ! page 137K. Dickersin, Y.-I. Min, and C. L. Meinert. Factors influencing publication ofresearch results: follow-up of applications submitted to two institutional reviewboards. The Journal of the American Medical Association, 267(3):374–378,1992. ! page 2Z. Dienes. Using Bayes to get the most out of non-significant results. Frontiers inPsychology, 5:781, 2014. ! page 17Z. Dienes. How Bayes factors change scientific practice. Journal of MathematicalPsychology, 72:78–89, 2016. ! page 4Z. Dienes and N. Mclatchie. Four reasons to prefer Bayesian analyses oversignificance testing. Psychonomic Bulletin & Review, pages 1–12, 2017. !pages 4, 11, 17147Z. Dienes and N. Mclatchie. Four reasons to prefer Bayesian analyses oversignificance testing. Psychonomic Bulletin & Review, 25(1):207–218, 2018. !page 101U. Dirnagl et al. Fighting publication bias: introducing the negative resultssection. Journal of Cerebral Blood Flow and Metabolism, 30(7):1263–1264,2010. ! page 22P. M. Dixon and J. H. Pechmann. A statistical test to show negligible trend.Ecology, 86(7):1751–1756, 2005. ! pages 9, 12B. Djulbegovic and I. Hozo. When should potentially false research findings beconsidered acceptable? PLoS Medcine, 4(2):e26, 2007. ! page 141B. Djulbegovic, A. Kumar, A. Magazin, A. T. Schroen, H. Soares, I. Hozo,M. Clarke, D. Sargent, and M. J. Schell. Optimism bias leads to inconclusiveresults - an empirical study. Journal of Clinical Epidemiology, 64(6):583–593,2011. ! pages 4, 12P. Doshi, K. Dickersin, D. Healy, S. S. Vedula, and T. Jefferson. Restoringinvisible and abandoned trials: a call for people to publish the findings. TheBMJ, 346:f2865, 2013. ! page 2E. Dumas-Mallet, K. S. Button, T. Boraud, F. Gonon, and M. R. Munafo`. Lowstatistical power in biomedical science: a review of three human researchdomains. Royal Society Open Science, 4(2):160254, 2017. ! pages 39, 50K. Dwan, D. G. Altman, J. A. Arnaiz, J. Bloom, A.-W. Chan, E. Cronin,E. Decullier, P. J. Easterbrook, E. Von Elm, C. Gamble, et al. Systematicreview of the empirical evidence of study publication bias and outcomereporting bias. PLoS ONE, 3(8):e3081, 2008. ! page 2A. Etz and J. Vandekerckhove. A Bayesian perspective on the reproducibilityproject: Psychology. PLoS ONE, 11(2):e0149794, 2016. ! page 4D. Fanelli. How many scientists fabricate and falsify research? a systematicreview and meta-analysis of survey data. PLoS ONE, 4(5):e5738, 2009. !page 2D. Fanelli. Negative results are disappearing from most disciplines and countries.Scientometrics, 90(3):891–904, 2011. ! page 41F. Fidler, M. A. Burgman, G. Cumming, R. Buttrose, and N. Thomason. Impactof criticism of null-hypothesis significance testing on statistical reporting148practices in conservation biology. Conservation Biology, 20(5):1539–1544,2006. ! page 38K. Fiedler. What constitutes strong psychological science? the (neglected) role ofdiagnosticity and a priori theorizing. Perspectives on Psychological Science, 12(1):46–61, 2017. ! page 43M. G. Findley, N. M. Jensen, E. J. Malesky, and T. B. Pepinsky. Can results-freereview reduce publication bias? the results and implications of a pilot study.Comparative Political Studies, 49(13):1667–1703, 2016. ! pages 3, 138M. Fire and C. Guestrin. Over-optimization of academic publishing metrics:Observing goodhart’s law in action. arXiv preprint arXiv:1809.07841, 2018.! page 46R. C. Fraley and S. Vazire. The n-pact factor: Evaluating the quality of empiricaljournals with respect to sample size and statistical power. PLoS ONE, 9(10):e109019, 2014. ! page 42A. Franco, N. Malhotra, and G. Simonovits. Publication bias in the socialsciences: Unlocking the file drawer. Science, 345(6203):1502–1505, 2014. !pages 2, 137A. Frankel and M. Kasy. Which findings should be published? HarvardUniversity working paper, 2018. ! page 120W. E. Frankenhuis and D. Nettle. Open science is liberating and can fostercreativity. Perspectives on Psychological Science, 13(4):439–447, 2018. !page 87L. P. Freedman, I. M. Cockburn, and T. S. Simcoe. The economics ofreproducibility in preclinical research. PLoS Biology, 13(6):e1002165, 2015.! page 2A. Fritz, T. Scherndl, and A. Ku¨hberger. A comprehensive review of reportingpractices in psychological journals: Are effect sizes really enough? Theory &Psychology, 23(1):98–122, 2013. ! page 38T. Gall, J. Ioannidis, and Z. Maniadis. The credibility crisis in research: Caneconomics tools help? PLoS Biology, 15(4):e2001846, 2017. ! page 40M. J. Gardner and D. G. Altman. Confidence intervals rather than p-values:estimation rather than hypothesis testing. The BMJ (Clin Res Ed), 292(6522):746–750, 1986. ! page 10149J. Gaudart, L. Huiart, P. J. Milligan, R. Thiebaut, and R. Giorgi. Reproducibilityissues in science, is p value really the only answer? Proceedings of theNational Academy Sciences of the United States of America, 111:E1934, 2014.! pages 37, 38A. Gelman. Commentary: p-values and statistical practice. Epidemiology, 24(1):69–72, 2013a. ! page 2A. Gelman. Preregistration of studies and mock reports. Political Analysis, 21(1):40–41, 2013b. ! page 23A. Gelman. “retire statistical significance”: The discussion. 20 March 2019:https://statmodeling.stat.columbia.edu/2019/03/20/retire-statistical-significance-the-discussion/, 2019. ! page 138A. Gelman and J. Carlin. Beyond power calculations assessing type s (sign) andtype m (magnitude) errors. Perspectives on Psychological Science, 9(6):641–651, 2014. ! pages 49, 51A. Gelman and J. Carlin. Some natural solutions to the p-value communicationproblem and why they won’t work. Journal of the American StatisticalAssociation, 112(519):899–901, 2017. ! page 137A. Gelman and E. Loken. The garden of forking paths: Why multiplecomparisons can be a problem, even when there is no “fishing expedition” or“p-hacking” and the research hypothesis was posited ahead of time.http://www.stat.columbia.edu/ gelman/research/unpublished/p hacking.pdf,2013. Unpublished, Accessed on 2019-01-01. ! page 22W. M. Gervais, J. A. Jewell, M. B. Najle, and B. K. Ng. A powerful nudge?presenting calculable consequences of underpowered research shifts incentivestoward adequately powered designs. Social Psychological and PersonalityScience, 6(7):847–854, 2015. ! pages 27, 42J. J. Goeman, A. Solari, and T. Stijnen. Three-sided hypothesis testing:Simultaneous testing of superiority, equivalence and inferiority. Statistics inMedicine, 29(20):2117–2125, 2010. ! page 7B. Goldacre, H. Drysdale, A. Dale, I. Milosevic, E. Slade, P. Hartley, C. Marston,A. Powell-Smith, C. Heneghan, and K. R. Mahtani. Compare: a prospectivecohort study correcting and monitoring 58 misreported trials in real time.Trials, 20(1):118, 2019. ! page 121150M. Go¨nen. The Bayesian t-test and beyond. Statistical Methods in MolecularBiology, pages 179–199, 2010. ! page 17S. Goodman and S. Greenland. Assessing the unreliability of the medicalliterature: a response to ‘Why most published research findings are false?’.Johns Hopkins University, Department of Biostatistics; Working Papers, 2007.! page 1S. N. Goodman. Why is getting rid of p-values so hard? musings on science andstatistics. The American Statistician, 73(sup1):26–30, 2019. ! page 139C. Gornitzki, A. Larsson, and B. Fadeel. Freewheelin’scientists: citing Bob Dylanin the biomedical literature. The BMJ, 351, 2015. ! page 40J. A. Grand, S. G. Rogelberg, G. C. Banks, R. S. Landis, and S. Tonidandel. Fromoutcome to process focus: Fostering a more robust psychological sciencethrough registered reports and results-blind reviewing. Perspectives onPsychological Science, 13(4):448–456, 2018. ! page 93C. J. Greene, L. A. Morland, V. L. Durkalski, and B. C. Frueh. Noninferiority andequivalence designs: issues and implications for mental health research.Journal of Traumatic Stress, 21(5):433–439, 2008. ! page 7S. Greenland. Nonsignificance plus high power does not imply support for the nullover the alternative. Annals of Epidemiology, 22(5):364–368, 2012. ! page 2S. Greenland and C. Poole. Living with p-values: resurrecting a Bayesianperspective on frequentist statistics. Epidemiology, 24(1):62–68, 2013. ! page20A. G. Greenwald. Consequences of prejudice against the null hypothesis.Psychological Bulletin, 82(1):1–20, 1975. ! pages 2, 40W. Greve, A. Bro¨der, and E. Erdfelder. Result-blind peer reviews and editorialdecisions: A missing pillar of scientific culture. European Psychologist, 18(4):286–294, 2013. ! pages 2, 3, 88G. Guyatt, R. Jaeschke, N. Heddle, D. Cook, H. Shannon, and S. Walter. Basicstatistics for clinicians: 2. interpreting study results: confidence intervals.CMAJ: Canadian Medical Association Journal, 152(2):169–173, 1995. !page 10L. L. Haak, M. Fenner, L. Paglione, E. Pentz, and H. Ratner. Orcid: a system touniquely identify researchers. Learned Publishing, 25(4):259–264, 2012. !page 25151K. Hagen. Novel or reproducible: That is the question. Glycobiology, 26(5):429–429, 2016. ! page 43S. D. Halpern, J. H. Karlawish, and J. A. Berlin. The continuing unethicalconduct of underpowered clinical trials. The Journal of the American MedicalAssociation, 288(3):358–362, 2002. ! page 140T. E. Hardwicke and J. P. A. Ioannidis. Mapping the universe of registeredreports. Nature Human Behaviour, 2(11):793–796, 2018.doi:10.1038/s41562-018-0444-y. URLhttps://doi.org/10.1038/s41562-018-0444-y. ! page 95J. Hartung, J. E. Cottrell, and J. P. Giffin. Absence of evidence is not evidence ofabsence. Anesthesiology, 58(3):298–299, 1983. ! page 2W. W. Hauck and S. Anderson. A proposal for interpreting and reporting negativestudies. Statistics in Medicine, 5(3):203–209, 1986. ! pages 7, 11, 12, 13D. Hauschke, V. Steinijans, and E. Diletti. A distribution-free procedure for thestatistical analysis of bioequivalence studies. International Journal of ClinicalPharmacology, Therapy, and Toxicology, 28(2):72–78, 1990. ! page 9A. Hazra and N. Gogtay. Biostatistics series module 5: Determining sample size.Indian Journal of Dermatology, 61(5):496, 2016. ! pages 3, 41L. V. Hedges. Modeling publication selection effects in meta-analysis. StatisticalScience, pages 246–255, 1992. ! page 139P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deepreinforcement learning that matters. In Thirty-Second AAAI Conference onArtificial Intelligence, 2018. ! page 138I. Hernandez, W. F. Gellad, and C. B. Good. Lowering the p-value threshold -reply. The Journal of the American Medical Association, 320(9):937–938,2018. ! page 38A. D. Higginson and M. R. Munafo`. Current incentives for scientists lead tounderpowered studies with erroneous conclusions. PLoS Biology, 14(11):e2000995, 2016. ! pages 27, 39, 40, 42, 43, 45, 119R. Hoekstra, R. Monden, D. van Ravenzwaaij, and E.-J. Wagenmakers. Bayesianreanalysis of null results reported in the New England Journal of Medicine:Strong yet variable evidence for the absence of treatment effects. Manuscriptsubmitted for publication, 2017. ! page 18152J. M. Hoenig and D. M. Heisey. The abuse of power: the pervasive fallacy ofpower calculations for data analysis. The American Statistician, 55(1):19–24,2001. ! page 2M. A. Hofmann. Null hypothesis significance testing in simulation. InProceedings of the 2016 Winter Simulation Conference, pages 522–533. IEEEPress, 2016. ! page 1R. Hubbard and M. J. Bayarri. Confusion over measures of evidence (p’s) versuserrors (a’s) in classical statistical testing. The American Statistician, 57(3):171–178, 2003. ! page 60M. Huic´, M. Marusˇic´, and A. Marusˇic´. Completeness and changes in registereddata and reporting bias of randomized controlled trials in ICMJE journals aftertrial registration policy. PLoS ONE, 6(9):e25258, 2011. ! page 23H. Hung, S.-J. Wang, and R. O’Neill. A regulatory perspective on choice ofmargin and statistical inference issue in non-inferiority trials. BiometricalJournal, 47(1):28–36, 2005. ! page 12J. IntHout, J. P. Ioannidis, and G. F. Borm. Obtaining evidence by a singlewell-powered trial or several modestly powered trials. Statistical Methods inMedical Research, 25(2):538–552, 2016. ! page 39J. P. Ioannidis. Why most published research findings are false. PLoS Medicine, 2(8):e124, 2005. ! pages 1, 41J. P. Ioannidis. Journals should publish all null results and should sparinglypublish ‘positive’ results. Cancer Epidemiology Biomarkers & Prevention, 15(1):186–186, 2006. ! page 22J. P. Ioannidis. Why most discovered true associations are inflated. Epidemiology,pages 640–648, 2008. ! page 48J. P. Ioannidis. The importance of predefined rules and prespecified statisticalanalyses: Do not abandon significance. Journal of the American MedicalAssociation, 321(21):2067–2068, 2019. ! page 138J. P. Ioannidis, I. Hozo, and B. Djulbegovic. Optimal type I and type II error pairswhen the available sample size is fixed. Journal of Clinical Epidemiology, 66(8):903–910, 2013. ! pages 55, 61C. Jennison and B. W. Turnbull. Group Sequential Methods with Applications toClinical Trials. CRC Press, 1999. ! page 20153B. Jia and H. S. Lynn. A sample size planning approach that considers bothstatistical significance and clinical significance. Trials, 16(1):213, 2015. !page 17M. R. Jiroutek, K. E. Muller, L. L. Kupper, and P. W. Stewart. A new method forchoosing sample size for confidence interval–based inferences. Biometrics, 59(3):580–590, 2003. ! page 17V. E. Johnson. Revised standards for statistical evidence. Proceedings of theNational Academy of Sciences, 110(48):19313–19317, 2013. ! page 37V. E. Johnson. Reply to Gelman, Gaudart, Pericchi: More reasons to revisestandards for statistical evidence. Proceedings of the National Academy ofSciences, 111(19):E1936–E1937, 2014. ! page 38V. E. Johnson, R. D. Payne, T. Wang, A. Asher, and S. Mandal. On thereproducibility of psychological science. Journal of the American StatisticalAssociation, 112(517):1–10, 2017. ! page 50K. J. Jonas and J. Cesario. Submission guidelines for authors, comprehensiveresults in social psychology.www.tandf.co.uk/journals/authors/rrsp-submission-guidelines.pdf, 2017. !pages 5, 96Journal of Cognition Editorial. BMC Biology - Registered Reports.https://www.journalofcognition.org/about/registered-reports/, 2018. ! page 5R. M. Kaplan and V. L. Irvin. Likelihood of null effects of large NHLBI clinicaltrials has increased over time. PLoS ONE, 10(8):e0132382, 2015. ! page 88S. Kaul and G. A. Diamond. Good enough: a primer on the analysis andinterpretation of noninferiority trials. Annals of Internal Medicine, 145(1):62–69, 2006. ! page 7R. S. Keefe, H. C. Kraemer, R. S. Epstein, E. Frank, G. Haynes, T. P. Laughren,J. Mcnulty, S. D. Reed, J. Sanchez, and A. C. Leon. Defining a clinicallymeaningful effect for the design and interpretation of randomized controlledtrials. Innovations in Clinical Neuroscience, 10(5-6 Suppl A):4S, 2013. !page 12N. L. Kerr. Harking: Hypothesizing after the results are known. Personality andSocial Psychology Review, 2(3):196–217, 1998. ! page 22154J. Kimmelman, J. S. Mogil, and U. Dirnagl. Distinguishing between exploratoryand confirmatory preclinical research will improve translation. PLoS Biology,12(5):e1001863, 2014. ! pages 48, 66E. A. Konijn, R. van de Schoot, S. D. Winter, and C. J. Ferguson. Possiblesolution to publication bias through Bayesian statistics, including proper nullhypothesis testing. Communication Methods and Measures, 9(4):280–302,2015. ! page 17T. Koyama and P. H. Westfall. Decision-theoretic views on simultaneous testingof superiority and noninferiority. Journal of Biopharmaceutical Statistics, 15(6):943–955, 2005. ! page 14J. K. Kruschke and T. M. Liddell. The Bayesian new statistics: Hypothesistesting, estimation, meta-analysis, and power analysis from a Bayesianperspective. Psychonomic Bulletin & Review, pages 1–29, 2017. ! page 4A. Ku¨hberger, A. Fritz, and T. Scherndl. Publication bias in psychology: adiagnosis based on the correlation between effect size and sample size. PLoSONE, 9(9):e105825, 2014. ! page 137D. Lakens. Performing high-powered studies efficiently with sequential analyses.European Journal of Social Psychology, 44(7):701–710, 2014. ! page 20D. Lakens. Equivalence tests: A practical primer for t tests, correlations, andmeta-analyses. Social Psychological and Personality Science, 8(4):355–362,2017. ! pages 6, 12D. Lakens and M. Delacre. Equivalence testing and the second generationp-value. 2018. ! page 11D. Lakens, F. G. Adolfi, C. J. Albers, F. Anvari, M. A. Apps, S. E. Argamon,T. Baguley, R. B. Becker, S. D. Benning, D. E. Bradford, et al. Justify youralpha. Nature Human Behaviour, 2(3):168, 2018a. ! pages 61, 103D. Lakens, A. M. Scheel, and P. M. Isager. Equivalence testing for psychologicalresearch: A tutorial. Advances in Methods and Practices in PsychologicalScience, 1(2):259–269, 2018b. ! pages 8, 12H. J. Lamberink, W. M. Otte, M. R. Sinke, D. Lakens, P. P. Glasziou, J. K.Tijdink, and C. H. Vinkers. Statistical power of clinical trials increased whileeffect size remained stable: an empirical analysis of 136,212 clinical trialsbetween 1975 and 2014. Journal of Clinical Epidemiology, 102:123–128,2018. ! pages 41, 44, 50, 137155D. M. Lane and W. P. Dunlap. Estimating effect size: Bias resulting from thesignificance criterion in editorial decisions. British Journal of Mathematicaland Statistical Psychology, 31(2):107–112, 1978. ! page 49T. L. Lash. The harm done to reproducibility by the culture of null hypothesissignificance testing. American Journal of Epidemiology, 186(6):627–635,2017. ! page 1C. Lauzon and B. Caffo. Easy multiplicity control in equivalence testing usingtwo one-sided tests. The American Statistician, 63(2):147–154, 2009. ! page139D. Lawlor. Quality in epidemiological research: should we be submitting papersbefore we have the results and submitting more hypothesis-generatingresearch? International Journal of Epidemiology, 36(5):940, 2007. ! page 22J. S. Lawton. Reproducibility and replicability of science and thoracic surgery.The Journal of Thoracic and Cardiovascular Surgery, 152(6):1489–1491,2016. ! page 85C. J. Lee and C. D. Schunn. Social biases and solutions for procedural objectivity.Hypatia, 26(2):352–373, 2011. ! page 50J. T. Leek and L. R. Jager. Is most published research really false? Annual Reviewof Statistics and Its Application, 4:109–122, 2017. ! page 1R. V. Lenth. Some practical guidelines for effective sample size determination.The American Statistician, 55(3):187–193, 2001. ! page 41M. J. Lew. To P or not to P: On the evidential nature of P-values and their place inscientific inference. arXiv preprint arXiv:1311.0081, 2013. ! pages 2, 29Y. Littner, F. B. Mimouni, S. Dollberg, and D. Mandel. Negative results andimpact factor: a lesson from neonatology. Archives of Pediatrics & AdolescentMedicine, 159(11):1036–1037, 2005. ! page 85M. C. Makel, J. A. Plucker, and B. Hegarty. Replications in psychology research:How often do they really occur? Perspectives on Psychological Science, 7(6):537–542, 2012. ! page 42M. Marsman and E.-J. Wagenmakers. Three insights from a Bayesianinterpretation of the one-sided p-value. Educational and PsychologicalMeasurement, 77(3):529–539, 2017. ! page 20156G. Martin and R. M. Clarke. Are psychology journals anti-replication? a snapshotof editorial practices. Frontiers in Psychology, 8:523, 2017. ! page 42S. Mathieu, I. Boutron, D. Moher, D. G. Altman, and P. Ravaud. Comparison ofregistered and published primary outcomes in randomized controlled trials. TheJournal of the American Medical Association, 302(9):977–984, 2009. ! page23S. Mathieu, A.-W. Chan, and P. Ravaud. Use of trial register information duringthe peer review process. PLoS ONE, 8(4):e59910, 2013. ! page 23J. N. Matthews. Small clinical trials: are they all bad? Statistics in Medicine, 14(2):115–126, 1995. ! page 3A. McLaughlin. In pursuit of resistance: pragmatic recommendations for doingscience within one’s means. European Journal for Philosophy of Science, 1(3):353–371, 2011. ! page 40D. McNamee, A. James, and S. Kleinert. Protocol review at the lancet. TheLancet, 372(9634):189–190, 2008. ! page 87B. B. McShane, D. Gal, A. Gelman, C. Robert, and J. L. Tackett. Abandonstatistical significance. arXiv preprint arXiv:1709.07588, 2017. ! page 138L. K. Mell and A. L. Zietman. Introducing prospective manuscript review toaddress publication bias. International Journal of Radiation Oncology -BiologyPhysics, 90(4):729–732, 2014. ! page 22M. Meyners. Least equivalent allowable differences in equivalence testing. FoodQuality and Preference, 18(3):541–547, 2007. ! page 12M. Meyners. Equivalence tests–a review. Food Quality and Preference, 26(2):231–245, 2012. ! pages 8, 13J. Miller and R. Ulrich. Optimizing research payoff. Perspectives onPsychological Science, 11(5):664–691, 2016. ! page 40R. Moonesinghe, M. J. Khoury, and A. C. J. Janssens. Most published researchfindings are false - but a little replication goes a long way. PLoS Medicine, 4(2):e28, 2007. ! page 42J. F. Mudge, L. F. Baker, C. B. Edge, and J. E. Houlahan. Setting an optimal athat minimizes errors in null hypothesis significance tests. PLoS ONE, 7(2):e32734, 2012. ! page 61157J. Mulder and E.-J. Wagenmakers. Editors’ introduction to the special issue‘Bayes factors for testing hypotheses in psychological research: Practicalrelevance and new developments’. Journal of Mathematical Psychology, 72:1–5, 2016. ! page 17Nature Human Behaviour Editorial. What next for registered reports. NatureHuman Behaviour, 2(11):789–790, 2018. doi:10.1038/s41562-018-0477-2.URL https://doi.org/10.1038/s41562-018-0477-2. ! pages 87, 95, 102Nature Human Behaviour Editorial. The importance of no evidence. NatureHuman Behaviour, 3(3):197–197, 2019. doi:10.1038/s41562-019-0569-7.URL https://doi.org/10.1038/s41562-019-0569-7. ! page 137T.-H. Ng. Issues of simultaneous tests for noninferiority and superiority. Journalof Biopharmaceutical Statistics, 13(4):629–639, 2003. ! page 14T.-H. Ng. Noninferiority hypotheses and choice of noninferiority margin.Statistics in Medicine, 27(26):5392–5406, 2008. ! page 13C. L. Nord, V. Valton, J. Wood, and J. P. Roiser. Power-up: A reanalysis of ‘powerfailure’ in neuroscience using mixture modeling. Journal of Neuroscience, 37(34):8051–8061, 2017. ! page 39B. A. Nosek, J. R. Spies, and M. Motyl. Scientific utopia II. restructuringincentives and practices to promote truth over publishability. Perspectives onPsychological Science, 7(6):615–631, 2012. ! page 27B. A. Nosek, C. R. Ebersole, A. DeHaven, and D. Mellor. The preregistrationrevolution. 2017. URL osf.io/2dxu5. ! page 2J. Ocan˜a i Rebull, M. P. Sa´nchez Olavarrı´a, A. Sa´nchez, and J. L. Carrasco Jordan.On equivalence and bioequivalence testing. Sort, 32(2):151–176, 2008. !page 6A. O’Hagan, J. W. Stevens, and M. J. Campbell. Assurance in clinical trialdesign. Pharmaceutical Statistics, 4(3):187–201, 2005. ! page 16B. O’Hara. Negative results are published. Nature, 471(7339):448–449, 2011. !page 26OpenScienceCollaboration. Estimating the reproducibility of psychologicalscience. Science, 349(6251):aac4716, 2015. ! pages 1, 49158D. F. Parkhurst. Statistical significance tests: Equivalence and reverse tests shouldreduce misinterpretation equivalence tests improve the logic of significancetesting when demonstrating similarity is important, and reverse tests can helpshow that failure to reject a null hypothesis does not support that hypothesis.Bioscience, 51(12):1051–1057, 2001. ! page 11M. Pautasso. Worsening file-drawer problem in the abstracts of natural, medicaland social science databases. Scientometrics, 85(1):193–202, 2010. ! page137M.-E. Pe´rez and L. R. Pericchi. Changing statistical significance with the amountof information: The adaptive a significance level. Statistics & ProbabilityLetters, 85:20–24, 2014. ! page 139M. D. Perlman, L. Wu, et al. The emperor’s new tests. Statistical Science, 14(4):355–369, 1999. ! page 13G. Piaggio, D. R. Elbourne, D. G. Altman, S. J. Pocock, S. J. Evans, C. Group,et al. Reporting of noninferiority and equivalence randomized trials: anextension of the consort statement. The Journal of the American MedicalAssociation, 295(10):1152–1160, 2006. ! page 13R. J. Ploutz-Snyder, J. Fiedler, and A. H. Feiveson. Justifying small-n research inscientifically amazing settings: challenging the notion that only ?big-n? studiesare worthwhile. Journal of Applied Physiology, 116(9):1251–1252, 2014. !page 140S. J. Pocock and G. W. Stone. The primary outcome fails -what next? NewEngland Journal of Medicine, 375(9):861–870, 2016. ! page 11K. Popper. The logic of scientific discovery. London, UK: Hutchinson, 1959. !page 41L. Pusztai, C. Hatzis, and F. Andre. Reproducibility of research and preclinicalvalidation: problems and solutions. Nature Reviews Clinical Oncology, 10(12):720, 2013. ! page 85S. Ramsey and J. Scoggins. Commentary: practicing on the tip of an informationiceberg? Evidence of underpublication of registered clinical trials in oncology.The Oncologist, 13(9):925–929, 2008. ! page 23C. S. Reichardt and H. F. Gollob. When confidence intervals should be usedinstead of statistical tests, and vice versa. In L. L. Harlow, S. A. Mulaik, and159J. H. Steiger, editors, What If There Were No Significance Tests?: ClassicEdition, chapter 10, pages 231–254. Routledge, 2016. ! page 10S. Reysen. Publication of nonsignificant results: a survey of psychologists’opinions. Psychological Reports, 98(1):169–175, 2006. ! page 2F. D. Richard, C. F. Bond Jr, and J. J. Stokes-Zoota. One hundred years of socialpsychology quantitatively described. Review of General Psychology, 7(4):331,2003. ! pages 44, 45J. M. Roos. Measuring the effects of experimental costs on sample sizes.http://www.jasonmtroos.com/assets/media/papers/Experimental Costs Sample Sizes.pdf, 2017. Unpublished, accessed on2019-01-01. ! pages 41, 49R. Rosenthal. The file drawer problem and tolerance for null results.Psychological Bulletin, 86(3):638–641, 1979. ! page 137J. S. Ross, G. K. Mulvey, E. M. Hines, S. E. Nissen, and H. M. Krumholz. Trialpublication after registration in clinicaltrials.gov: a cross-sectional analysis.PLoS Medicine, 6(9):e1000144, 2009. ! page 23K. J. Rothman and S. Greenland. Planning study size based on precision ratherthan power. Epidemiology, 29(5):599–603, 2018. ! page 17J. N. Rouder. Optional stopping: No problem for Bayesians. PsychonomicBulletin & Review, 21(2):301, 2014. ! page 20J. N. Rouder, P. L. Speckman, D. Sun, R. D. Morey, and G. Iverson. Bayesian ttests for accepting and rejecting the null hypothesis. Psychonomic Bulletin &Review, 16(2):225–237, 2009. ! pages 18, 19, 31, 98D. L. Sackett and D. J. Cook. Can we learn anything from small trials? Annals ofthe New York Academy of Sciences, 703(1):25–32, 1993. ! page 3J. K. Sakaluk. Exploring small, confirming big: An alternative system to the newstatistics for advancing cumulative and replicable psychological research.Journal of Experimental Social Psychology, 66:47–54, 2016. ! pages 48, 66F. D. Scho¨nbrodt and E.-J. Wagenmakers. Bayes factor design analysis: Planningfor compelling evidence. Psychonomic Bulletin & Review, pages 1–15, 2016.! page 20K. F. Schulz and D. A. Grimes. Sample size calculations in randomised trials:mandatory and mystical. The Lancet, 365(9467):1348–1353, 2005. ! page 3160S. Senn. Misunderstanding publication bias: editors are not blameless after all.F1000Research, 1, 2012. ! pages 2, 51Y. Shao, V. Mukhi, and J. D. Goldberg. A hybrid Bayesian-frequentist approachto evaluate clinical trial designs for tests of superiority and non-inferiority.Statistics in Medicine, 27(4):504–519, 2008. ! page 17G. Shieh. Exact power and sample size calculations for the two one-sided tests ofequivalence. PLoS ONE, 11(9):e0162093, 2016. ! page 14P. G. Shields. Publication bias is a scientific problem with adverse ethicaloutcomes: The case for a section for null results. Cancer Epidemiology andPrevention Biomarkers, 9(8):771–772, 2000. ISSN 1055-9965. URLhttp://cebp.aacrjournals.org/content/9/8/771. ! page 22P. G. Shields, T. A. Sellers, and T. R. Rebbeck. Null results in brief: meeting aneed in changing times. Cancer Epidemiology and Prevention Biomarkers, 18(9):2347–2347, 2009. ! page 138K. Siler, K. Lee, and L. Bero. Measuring the effectiveness of scientificgatekeeping. Proceedings of the National Academy of Sciences, 112(2):360–365, 2015. ! page 50J. P. Simmons, L. D. Nelson, and U. Simonsohn. False-positive psychology:Undisclosed flexibility in data collection and analysis allows presentinganything as significant. Psychological Science, 22(11):1359–1366, 2011. !pages 4, 22P. E. Smaldino and R. McElreath. The natural selection of bad science. RoyalSociety Open Science, 3(9):160384, 2016. ! pages 27, 41Y. M. Smulders. A two-step manuscript submission process can reducepublication bias. Journal of Clinical Epidemiology, 66(9):946–947, 2013. !page 22F. Song, Y. Loke, and L. Hooper. Why are medical and health-related studies notbeing published? A systematic review of reasons given by investigators. PLoSONE, 9(10):e110418, 2014. ! pages 22, 50S. Y. Song, D.-H. Koo, S.-Y. Jung, W. Kang, and E. Y. Kim. The significance ofthe trial outcome was associated with publication rate and time to publication.Journal of Clinical Epidemiology, 84:78–84, 2017. ! page 23161V. Spezi, S. Wakeling, S. Pinfield, J. Fry, C. Creaser, and P. Willett. “let thecommunity decide” the vision and reality of soundness-only peer review inopen-access mega-journals. Journal of Documentation, 74(1):137–161, 2018.! page 121D. Spiegelhalter. Trust in numbers. Journal of the Royal Statistical Society:Series A (Statistics in Society), 180(4):948–965, 2017. ! page 2T. Stanley, S. B. Jarrell, and H. Doucouliagos. Could it be better to discard 90%of the data? A statistical paradox. The American Statistician, 64(1):70–77,2010. ! page 61T. D. Sterling, W. L. Rosenbaum, and J. J. Weinkam. Publication decisionsrevisited: The effect of the outcome of statistical tests on the decision to publishand vice versa. The American Statistician, 49(1):108–112, 1995. ! page 2P. Sun˜e´, J. M. Sun˜e´, and J. B. Montoro. Positive outcomes influence the rate andtime to publication, but not the impact factor of publications of clinical trialresults. PLoS ONE, 8(1):e54583, 2013. ! page 2D. Szucs and J. Ioannidis. When null hypothesis significance testing is unsuitablefor research: a reassessment. Frontiers in Human Neuroscience, 11:390, 2017a.! page 1D. Szucs and J. P. Ioannidis. Empirical assessment of published effect sizes andpower in the recent cognitive neuroscience and psychology literature. PLoSBiology, 15(3):e2000797, 2017b. ! pages 42, 90M. Toerien, S. T. Brookes, C. Metcalfe, I. De Salis, Z. Tomlin, T. J. Peters,J. Sterne, and J. L. Donovan. A review of reporting of participant recruitmentand retention in rcts in six major journals. Trials, 10(1):52, 2009. ! page 24D. Trafimow and M. Marks. Editorial. Basic and Applied Social Psychology, 37(1):1–2, 2015. doi:10.1080/01973533.2015.1012991. ! page 2H.-H. Tsou, C.-F. Hsiao, S.-C. Chow, L. Yue, Y. Xu, and S. Lee. Mixednoninferiority margin and statistical tests in active controlled trials. Journal ofBiopharmaceutical Statistics, 17(2):339–357, 2007. ! page 13M. A. van Assen, R. C. van Aert, M. B. Nuijten, and J. M. Wicherts. Whypublishing everything is more effective than selective publishing of statisticallysignificant results. PLoS ONE, 9(1):e84896, 2014. ! page 137162D. van Dijk, O. Manor, and L. B. Carey. Publication metrics and success on theacademic job market. Current Biology, 24(11):R516–R517, 2014. ! page 46M. van Lent, J. IntHout, and H. J. Out. Differences between information inregistries and articles did not influence publication acceptance. Journal ofClinical Epidemiology, 68(9):1059–1067, 2015. ! page 23S. Vasishth and A. Gelman. The illusion of power: How the statisticalsignificance filter leads to overconfident expectations of replicability. arXivpreprint arXiv:1702.00556, 2017. ! page 4E. Wagenmakers, R. Wetzels, D. Borsboom, et al. Why psychologists mustchange the way they analyze their data: the case of psi: comment on bem(2011). Journal of Personality and Social Psychology, 100(3):426–432, 2011.! page 20E.-J. Wagenmakers. A practical solution to the pervasive problems of p-values.Psychonomic Bulletin & Review, 14(5):779–804, 2007. ! page 4E.-J. Wagenmakers, M. Marsman, T. Jamil, A. Ly, J. Verhagen, J. Love, R. Selker,Q. F. Gronau, M. Sˇmı´ra, S. Epskamp, et al. Bayesian inference for psychology.part i: Theoretical advantages and practical ramifications. PsychonomicBulletin & Review, 25(1):35–57, 2018. ! page 101E. Wager and P. Williams. “Hardly worth the effort” -Medical journals’ policiesand their editors’ and publishers’ views on trial registration and publicationbias: quantitative and qualitative study. The BMJ, 347:f5248, 2013. ! page 23A. Walker. Low power and striking results–a surprise but not a paradox. The NewEngland Journal of Medicine, 332(16):1091, 1995. ! page 39E. Walker and A. S. Nowacki. Understanding equivalence and noninferioritytesting. Journal of General Internal Medicine, 26(2):192–196, 2011. ! pages8, 13G. W. Walster and T. A. Cleary. A proposal for a new editorial policy in the socialsciences. The American Statistician, 24(2):16–19, 1970. ! pages 2, 22, 87M. Warren. First analysis of ‘pre-registered’ studies shows sharp rise in nullfindings. nature.com website, 2018. URLhttps://www.nature.com/articles/d41586-018-07118-1. ! page 94R. L. Wasserstein and N. A. Lazar. The ASA’s statement on p-values: context,process, and purpose. The American Statistician, 70(2):129–133, 2016. !page 2163Y. Wei and F. Chen. Lowering the p-value threshold - reply. The Journal of theAmerican Medical Association, 320(9):937–938, 2018. ! page 38S. Wellek. Testing Statistical Hypotheses of Equivalence and Noninferiority. CRCPress, 2010. ! page 9S. Wellek. A critical evaluation of the current “p-value controversy”. Biometricaljournal, 59(5):854–872, 2017. ! pages 13, 139B. L. Wiens. Choosing an equivalence limit for noninferiority or equivalencestudies. Controlled Clinical Trials, 23(1):2–14, 2002. ! page 12B. L. Wiens and B. Iglewicz. Design and analysis of three treatment equivalencetrials. Controlled Clinical Trials, 21(2):127–137, 2000. ! page 9B. M. Wilson and J. T. Wixted. The prior odds of testing a true effect in cognitiveand social psychology. Advances in Methods and Practices in PsychologicalScience, 1(2):186–197, 2018. ! page 49R. Wiseman, C. Watt, and D. Kornbrot. Registered reports: an early example andanalysis. PeerJ, 7(e6232), 2019. doi:doi.org/10.7717/peerj.6232. ! page 87A. W. Yeung. Do neuroscience journals accept replications? A survey ofliterature. Frontiers in Human Neuroscience, 11:468, 2017. ! page 42D. A. Zarin, T. Tse, R. J. Williams, and S. Carr. Trial reporting inclinicaltrials.gov -the final rule. New England Journal of Medicine, 375(20):1998–2004, 2016. ! page 23X. Zhang, G. Cutter, and T. Belin. Bayesian sample size determination underhypothesis tests. Contemporary Clinical Trials, 32(3):393–398, 2011. ! pages4, 99G. Zhao. Considering both statistical and clinical significance. InternationalJournal of Statistics and Probability, 5(5):16, 2016. ! page 7H. Zhu. Sample size calculation for comparing two poisson or negative binomialrates in noninferiority or equivalence trials. Statistics in BiopharmaceuticalResearch, 9(1):107–115, 2017. ! page 14B. D. Zumbo and A. M. Hubley. A note on misconceptions concerningprospective and retrospective power. Journal of the Royal Statistical Society:Series D (The Statistician), 47(2):385–388, 1998. ! page 2164
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- If journals embraced conditional equivalence testing,...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
If journals embraced conditional equivalence testing, would research be better? Campbell, Harlan 2019
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | If journals embraced conditional equivalence testing, would research be better? |
Creator |
Campbell, Harlan |
Publisher | University of British Columbia |
Date Issued | 2019 |
Description | We consider the reliability of published science: the probability that scientific claims put forth are true. Low reliability within many scientific fields is of major concern for researchers, scientific journals and the public at large. In the first part of this thesis, we introduce a publication policy that incorporates ''conditional equivalence testing'' (CET), a two-stage testing scheme in which standard null-hypothesis significance testing is followed, if the null hypothesis is not rejected, by testing for equivalence. The idea of CET has the potential to address recent concerns about reproducibility and the limited publication of null results. We detail the implementation of CET, investigate similarities with a Bayesian testing scheme, and outline the basis for how a scientific journal could proceed to reduce publication bias while remaining relevant. In the second part of this thesis, we consider proposals to adopt measures of ''greater statistical stringency,'' including suggestions to require larger sample sizes and to lower the highly criticized ''p<0.05'' significance threshold. While pros and cons are vigorously debated, there has been little to no modeling of how adopting these measures might affect what type of science is published. We develop a novel model that, given current incentives to publish, predicts a researcher's most rational use of resources in terms of the number of studies to undertake, the statistical power to devote to each study, and the desirable pre-study odds to pursue. Using this model, we investigate the merits of adopting measures of ''greater statistical stringency'' with the goal of informing the ongoing debate. We also use this model to investigate the merits of alternative publication policies, including the registered reports policy and our novel CET publication policy. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2019-07-03 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0379722 |
URI | http://hdl.handle.net/2429/70873 |
Degree |
Doctor of Philosophy - PhD |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2019-09 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2019_september_campbell_harlan.pdf [ 35.91MB ]
- Metadata
- JSON: 24-1.0379722.json
- JSON-LD: 24-1.0379722-ld.json
- RDF/XML (Pretty): 24-1.0379722-rdf.xml
- RDF/JSON: 24-1.0379722-rdf.json
- Turtle: 24-1.0379722-turtle.txt
- N-Triples: 24-1.0379722-rdf-ntriples.txt
- Original Record: 24-1.0379722-source.json
- Full Text
- 24-1.0379722-fulltext.txt
- Citation
- 24-1.0379722.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0379722/manifest