UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Simpson's paradox in epistemology and decision theory Memetea, Sonia 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2015_september_memetea_sonia.pdf [ 850.17kB ]
Metadata
JSON: 24-1.0167719.json
JSON-LD: 24-1.0167719-ld.json
RDF/XML (Pretty): 24-1.0167719-rdf.xml
RDF/JSON: 24-1.0167719-rdf.json
Turtle: 24-1.0167719-turtle.txt
N-Triples: 24-1.0167719-rdf-ntriples.txt
Original Record: 24-1.0167719-source.json
Full Text
24-1.0167719-fulltext.txt
Citation
24-1.0167719.ris

Full Text

Simpson’s Paradoxin Epistemology and Decision TheorybySonia MemeteaBPhil, University of Oxford, 1999BA, University of Oxford, 1997a thesis submitted in partial fulfillmentof the requirements for the degree ofDoctor of Philosophyinthe faculty of graduate and postdoctoralstudies(Philosophy)The University of British Columbia(Vancouver)May 2015© Sonia Memetea, 2015AbstractI discuss the implications of Simpson’s paradox for epistemology and decisiontheory. In Chapter One I outline the paradox, focussing on its identification,nature, and type of reasoning that it involves. In Chapter Two I discuss theview that Simpson’s paradox is resolved by means of graph-based causalanalysis. In Chapter Three I outline a major problem (hitherto unacknowl-edged) that Simpson’s paradox poses for the probabilistic Bayesian theoryof evidence. I make a proposal to split the probabilistic concept of evidenceinto a causal and a news-value kind of evidence, tracking causal probabili-ties under an intervention, and tracking overall probabilistic relevance underconditioning. In Chapters Four and Five I apply the proposal to two furtherareas of concern. In Chapter Four I defend a unified causal view of Simp-son’s paradox. In Chapter Five I defuse a problem about counterfactualforce of evidence. In Chapters Six and Seven I discuss Simpson’s paradoxand the sure thing principle in decision theory. I then conclude in ChapterEight.iiPrefaceThis dissertation is original, unpublished, independent work by the author,Sonia Memetea.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Beginnings: E.H. Simpson’s paradox . . . . . . . . . . . . . . . 31.2 Areas of controversy . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.1 Identification of Simpson’s paradox . . . . . . . . . . . 71.2.2 Nature of Simpson’s paradox . . . . . . . . . . . . . . . 91.2.3 Reasoning in Simpson’s paradox . . . . . . . . . . . . . 151.3 Open problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.3.1 Resolution of Simpson’s paradox . . . . . . . . . . . . . 181.3.2 Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . 201.3.3 Decision theory . . . . . . . . . . . . . . . . . . . . . . . 251.4 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 Puzzle or Paradox: Has Simpson’s Paradox Been Solved? 312.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31iv2.2 How to proceed in Simpson’s paradox . . . . . . . . . . . . . . 352.3 Backdoor criterion and Simpson’s paradox . . . . . . . . . . . 382.3.1 Decision theory and the problem of the Bad Partition 502.3.2 Objection from the nature of arguments based on causalisomorphisms . . . . . . . . . . . . . . . . . . . . . . . . . 522.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533 Two Notions of Evidence . . . . . . . . . . . . . . . . . . . . . . 563.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.2 Causation and evidence . . . . . . . . . . . . . . . . . . . . . . . 583.2.1 Causation and Simpson’s paradox . . . . . . . . . . . . 583.2.2 Evidence and Simpson’s paradox . . . . . . . . . . . . . 613.3 Causal and news value degree of belief . . . . . . . . . . . . . . 683.3.1 Comparison with Menzies and Price [55] . . . . . . . . 733.3.2 Prediction and explanation in Simpson’s paradox . . . 773.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794 Simpson’s Paradox Unified . . . . . . . . . . . . . . . . . . . . . 804.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804.2 Simpson’s paradox as a paradox about ratios . . . . . . . . . . 834.2.1 The ratio argument . . . . . . . . . . . . . . . . . . . . . 864.2.2 Future knowledge as knowledge by intervention . . . . 944.2.3 Causal reasoning in disguise . . . . . . . . . . . . . . . . 994.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1015 Simpson’s Paradox and Counterfactual Projection . . . . . .1025.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.2 Morton’s account . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065.2.1 Causal structure of Morton’s example . . . . . . . . . . 1095.2.2 A single evidential relation, or two? . . . . . . . . . . . 1125.2.3 Backtracking and non-backtracking counterfactual con-ditionals in relation to evidence . . . . . . . . . . . . . . 1165.3 Newcomb’s paradox . . . . . . . . . . . . . . . . . . . . . . . . . 1185.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120v6 Simpson’s Paradox and the Sure Thing Principle . . . . . .1216.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216.1.1 Background to STP . . . . . . . . . . . . . . . . . . . . . 1246.2 SP challenges to STP . . . . . . . . . . . . . . . . . . . . . . . . 1276.2.1 First challenge: validity . . . . . . . . . . . . . . . . . . 1276.2.2 Second challenge: partition sensitivity . . . . . . . . . . 1316.3 Responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1366.3.1 Response to the first challenge: validity . . . . . . . . . 1366.3.2 Respose to the second challenge: partition sensitivity . 1376.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1457 Unrestricted Restricted STP . . . . . . . . . . . . . . . . . . .1467.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1467.1.1 Introducing the counterexample . . . . . . . . . . . . . 1477.2 Modelling the knowledge restriction . . . . . . . . . . . . . . . . 1557.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1618 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .162Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .165A Conditioning and Intervening . . . . . . . . . . . . . . . . . . .173A.1 Decomposing a joint probability function . . . . . . . . . . . . 174A.1.1 Formal representations . . . . . . . . . . . . . . . . . . . 177A.2 Connection with other frameworks . . . . . . . . . . . . . . . . 179viList of TablesTable 2.1 Medical case. . . . . . . . . . . . . . . . . . . . . . . . . . . . 36Table 2.2 Botanical case. . . . . . . . . . . . . . . . . . . . . . . . . . . 37Table 4.1 Ratio to probabilities. . . . . . . . . . . . . . . . . . . . . . . 87Table 4.2 Bennett distribution. . . . . . . . . . . . . . . . . . . . . . . . 88Table 4.3 Aggregation of data. . . . . . . . . . . . . . . . . . . . . . . . 88Table 5.1 Rejection data, Morton 2002. . . . . . . . . . . . . . . . . . . 107Table 6.1 Treatment data, Chicago. . . . . . . . . . . . . . . . . . . . . 127Table 6.2 Utilities Reoboam. . . . . . . . . . . . . . . . . . . . . . . . . 133Table 6.3 Reoboam data. . . . . . . . . . . . . . . . . . . . . . . . . . . 133Table 6.4 Reoboam redux. . . . . . . . . . . . . . . . . . . . . . . . . . . 139Table 6.5 Product data for charisma and Machiavellianism. . . . . . . 143viiList of FiguresFigure 2.1 A collapsible association incompatible with Simpson’s para-dox. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Figure 2.2 A second collapsible association incompatible with Simp-son’s paradox. . . . . . . . . . . . . . . . . . . . . . . . . . . 41Figure 2.3 A third collapsible association incompatible with Simp-son’s paradox. . . . . . . . . . . . . . . . . . . . . . . . . . . 41Figure 2.4 Common cause. . . . . . . . . . . . . . . . . . . . . . . . . . 43Figure 2.5 Indirect effect. . . . . . . . . . . . . . . . . . . . . . . . . . . 43Figure 2.6 Collider. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43Figure 2.7 Common parent. . . . . . . . . . . . . . . . . . . . . . . . . . 43Figure 3.1 Guilty pleasures. . . . . . . . . . . . . . . . . . . . . . . . . . 60Figure 3.2 An exodus of smokers. . . . . . . . . . . . . . . . . . . . . . 60Figure 3.3 The irrelevance of smoking. . . . . . . . . . . . . . . . . . . 68viiiAcknowledgmentsI am grateful for the support and encouragement of my supervising com-mittee: Paul Bartha, Adam Morton and Christopher Stephens. I am alsoespecially thankful to the Graduate Secretary, Rhonda Janzen.ixChapter 1IntroductionThe aim of this thesis is to argue for the importance of Simpson’s paradoxin epistemology, and for important implications in decision theory that gobeyond those that have so far been acknowledged. Although philosophi-cal debates about the nature of causation and causal reasoning have beenthoroughly permeated by an awareness of the consequential role played bySimpson’s paradox in delineating the limits of causal inference and reasoningabout causation in general, very little attention has been paid to its impli-cations for other areas in philosophy with fundamental ties to probability.The popular Bayesian conception of probabilistic evidence is one. Theomission of the concept of evidence from the debate around Simpson’s para-dox seems unaccountable. Whilst repercussions of the Simpson’s paradoxicalreversals of probabilistic relevance for probabilistic causality have been ex-tensively investigated, a corresponding debate focussed on the probabilistictheory of evidence has apparently not been initiated yet. It is a major aimof this thesis to close this gap. I will argue that to accommodate the forceof evidence in cases in which the Bayesian subject is represented as possess-ing causal knowledge, the classic Bayesian concept of evidence may need tosplit into two separate distinct notions of evidence, which by analogy withdecision theory I will call causal and news value evidence. Moreover, I willargue that formal epistemology needs to recognize the distinction betweentwo different kinds of degree of belief, degree of belief under an intervention1and degree of belief by conditioning.As I maintain in this thesis, the nature of Simpson’s paradox is inherentlycausal. The paradox does not arise merely from the probabilistic consistencyof models in which associations between variables are reversed relative todifferent ways of partitioning the logical space. Rather the paradox is causalin nature. As Judea Pearl has persuasively argued, the force of the paradoxcan only be grasped if we understand it in terms of its implications forcausal reasoning. The graph-theoretical framework developed by Pearl isby far the most formidable and successful representation of causation andcausal reasoning to date. It does much to tame and shed light on Simpson’sparadox. Yet I will argue that even in Pearl’s framework, the paradox is notresolved, as Pearl claims, by the application of purely graph-based techniquessuch as the backdoor criterion. There is an ineliminable contextual factorin our judgements about causation which the techniques developed by Pearland others appear to sideline.Moreover, I will argue that difficulties involving causal knowledge inSimpson’s paradox appear within decision theory itself. To date, the mostinfluential and well-developed theory of rational decision making remainsthat of L. Savage. Savage’s axiomatization of the notion of rational pref-erence, which enabled his central result (the representation theorem) to bederived, involves the famous sure thing principle. I will argue that thesure thing principle is susceptible to iterated forms of Simpson’s paradox,in which the reversed association relative to one partition is itself reversedrelative to a further partition, and so forth. The immediate implication forSavage’s theory is that it makes causal knowledge a precondition for theapplication of the framework. To know how to apply the framework, wemust have causal knowledge of a far more extensive nature than previouslysupposed.This chapter is organized as follows. Following a brief philosophical sur-vey of E.H. Simpson’s seminal article, I will outline the paradox in terms ofthree initial areas of controversy: its identification, nature and type of rea-soning that it involves. Once the initial characterization is completed, I willthen unfold three open problems which my thesis will tackle: the practical2problem of resolving Simpson’s paradox (selecting a course of action); theepistemological problem of formulating a viable concept of evidence; and twodecision-theoretic problems (incorporating causal knowledge into decisions,and applying Savage’s sure-thing principle). I conclude with a brief outlineof the dissertation.1.1 Beginnings: E.H. Simpson’s paradoxSimpson’s paradox refers to a probabilistic phenomenon in which an as-sociation between a pair of variables X,Y can be reversed or cancelled byconditioning on a third variable Z. More precisely, it includes cases in whichan association between X and Y switches direction by conditioning on Z, aswell as cases in which conditioning on Z reveals or destroys an associationobtaining between X and Y . In the framework of random variables whichhas become popular since the publication of Simpson’s landmark article, itimplies that partitioning by the variable Z can induce probabilistic relationsbetween X and Y which are either not visible when the data is aggregated,or else have a different polarity (and thus a different statistical significance)compared to the data disaggregated by conditioning on Z.The phenomenon to which E.H. Simpson gave his name in due coursehad been identified in the statistical literature a number of years prior tothe publication of Simpson [73]. Close historical antecedents include Yule[89] and Pearson [65]. In Cohen and Nagel [11] the reversal phenomenon isfirst noted explicitly. Owing to its complex ancestry, Simpson’s paradox hasvariously been referred to as Yule’s paradox, the amalgamation paradox, andthe reversal paradox. The philosophical community has embraced “Simp-son’s paradox”, perhaps in deference to to Simpson’s role in promulgatingthe first philosophically stimulating formulation of the phenomenon.The importance of Simpson’s contribution does not rest on his havingflagged the existence of the phenomenon. It rests on his being the first toproduce an argument in which judgments of causal and probabilistic equiva-lence between models are clearly distinct and consequential aspects of inter-pretation; and on Simpson’s ability to derive the correct implications insofar3as decisions based on Simpson’s paradoxical data are concerned. Simpsonexploited the possibility that two distinct causal processes can generate thesame patterns of probabilistic (in)dependence in the data, by fitting proba-bilistically isomorphic models (i.e. the same probability function describingboth) with causally inequivalent interpretations1. Moreover, he correctlynoted that sometimes “the sensible answer” as he put it to a decision prob-lem accords with the disaggregated data, and sometimes with the aggregate.In his original exposition of the problem Simpson presented a pair ofprobabilistically identical scenarios which nevertheless exemplified differentbackground causal structures:“Baby”.Baby plays with an ordinary pack of cards, inadvertently producing anassociation between colour (red, black) and class of card (court, plain) rela-tive to a partition of the cards into those which are dirty, and those whichare clean. In a standard pack of cards however there is no association be-tween colour and class of card. The intuitive explanation is that Baby has astrong preference for one attribute e.g. for court cards, and a preference fora specific colour e.g. for red, and the cards are dirty or clean depending onhow heavily they are foxed by Baby. In such a case, the partition associationisn’t “real”, Simpson remarks.“Treatment”A treatment is administered to men and women and the results of thestudy appear to suggest that the treatment is beneficial to each subpopulationbut has no efficacy in the aggregate population of individuals. The intuitiveanalysis is that men are both given the treatment preferentially, and tend torecover more often regardless of treatment, resulting in the masking of thetrue effect of the drug in the aggregate statistic. Simpson concludes that theassociation revealed in partitioning by gender is “real”.1Although it must be noted that Simpson did not use the word “cause” himself, it isclear from his discussion that he is alive to the distinction between mere data and causalstructure. I take Simpson to have been a causal theorist avant la lettre. The very strategyof appealing to causal isomorphism to make a point about causal structure - nowadays anextremely popular move in debates about causation - is one that Simpson himself seemsto have initiated.4Clearly the difference between the two stories lies in the causal grain.In the subsequently developed language of causal analysis, we may describeSimpson’s two examples as follows.For “Baby” :Probabilistically, this is a case in which X and Y representing colourand class are independent, but become dependent conditional on Z, repre-senting soiling by Baby. Causally, the structure is one in which X and Yare independent causes of a common effect Z. Graph theory refers to thisconfiguration X → Z ← Y as forming a collider at Z. A collider node doesnot pass information between X and Y unless the collider or a descendantof the collider is conditioned on.●●●XYZ77//For “Treatment” :The scenario is probabilistically identical to “Baby”, with X and Y in-terpreted as referring to treatment and recovery and Z as gender. Causally,however, the structure is one with confounding in which X and Y are theindividual effects of a common cause Z; moreover, there is a further directlink from X to Y representing the fact that treatment has a causal influenceon recovery.●●●ZXY77//Note that graph theory tells us that collider configurations normallyexhibit the pattern in whichX and Y are independent yet become dependent5conditional on Z. In statistics the phenomenon relates to Bergson’s paradox,alternatively known as the selection bias paradox2. But the interesting thingabout the second graph is that one would not normally expect X and Y tocome out probabilistically independent. It is a special feature of the exampleconstructed by Simpson that the bias in the treatment assignment variabletogether with the gender effect happen contingently to cancel out the directeffect of the treatment on recovery, resulting in the overall independence ofrecovery and treatment3.With the benefit of hindsight, we can state that Simpson’s remarkablypenetrating pair of illustrations drives home the following important lesson.The phenomenon of Simpson’s paradox is not the same as the phenomenonof confounding, although it may be possible to generate it in a confoundedcausal graph. Solving Simpson’s paradox, moreover, is not the same assolving the problem of confounding (how to resolve Simpson’s paradox is atopic to which I turn explicitly in chapter two). Simpson’s paradox may arisein a non-confounded graph, such as “Baby”. In order to resolve Simpson’sparadox we may have to get rid of confounding influences. But from thestart we ought to bear in mind that Simpson’s paradox cannot be identifiedwith the distinct phenomenon of confounding4.1.2 Areas of controversyTo get a better feel for Simpson’s paradox as it evolved out of Simpson’sseminal discussion it is worth examining three crucial areas of controversy.They are, the identification of the probabilistic conditions that define Simp-2Extreme example of Bergson’s paradox: if you know that an alarm was set off eitherby an earthquake or by a break-in, then finding out there was an earthquake renders abreak-in less probable, and vice-versa. Gentler case: you see that the security monitoringlight is flashing, which may well have been caused by the alarm, inducing dependenceamong its causes.3If we label the edges of the graph with regression coefficients α,β, γ as in regressionanalysis in a functional equation context, the choice of parameters would have to respectthe equation −αβ = γ.4Some of these remarks will become clearer in later chapters. An excellent sourcedealing with the relationship between confounding and the related statistical notion ofcollapsibility of a measure of association is Hernán et al. [27].6son’s paradox, the nature of Simpson’s paradox, and the reasoning thatmakes Simpson’s paradox paradoxical. The three highlighted areas do notconstitute an exhaustive list of problems, but they do contribute a helpfulcontext in which to synthesize and situate my own discussion in the thesis.1.2.1 Identification of Simpson’s paradoxAn important area of controversy in the literature on Simpson’s paradox isthe identification of the paradox as a probabilistic phenomenon. The influ-ential account by Blyth [8] defines cases to be Simpson’s paradoxical caseswhen an overall association is reversed or cancelled relative to a partition.But Blyth’s characterization did not include the case in which a partition-relative association is cancelled overall, as can be seen directly from hisfamous boundary conditions, permitting the following characterization ofSimpson’s paradox:Mathematically, Simpson’s paradox is this:It is possible to haveP (A ∣ B) < P (A ∣ B′)and have at the same time bothP (A ∣ BC) ≥ P (A ∣ B′C)P (A ∣ BC′) ≥ P (A ∣ B′C′)(8, page 364)Blyth’s boundary conditions helped delineate “the most extreme” formof Simpson’s paradox. He showed that it is possible to obtain a Simpson’sparadoxical probability function in which the probability of A conditionalon B is virtually zero, whilst conditioned on the negation of B it is virtuallyone. More generally, he demonstrated that:Subject to the conditionsP (A ∣ BC) ≥ δP (A ∣ B′C)P (A ∣ BC′) ≥ δP (A ∣ B′C′)7with δ ≥ 1, it is possible to haveP (A ∣ B) ≈ 0 and P (A ∣ B′) ≈ 1δThe case where the conditional probability of A relative to the partitionby B spans the entire interval can be obtained by setting δ to be 1. Essen-tially Blyth’s boundary conditions showed that it is possible to invert (or atleast virtually cancel out) any probabilistic association between A and B,even the most extremely polarized, by means of conditioning on a partitionC.Blyth’s boundary conditions and mathematical characterization of Simp-son’s paradox seem to go far towards identifying the paradox in a precisemathematical form. However, Bandyopadhyay et al. [5] object that Blyth’saccount is unnecessarily restrictive. As noted above, a case in which apartition-relative association is cancelled out overall is not covered by Blyth’sboundary conditions, yet intuitively it ought to count as Simpson’s paradox-ical. Indeed, there is no reason to count cessation of an overall correlationrelative to a partition, but disqualify the cessation of an association in theother direction. The objection seems legitimate, although as the alternativeaccount in Bandyopadhyay et al. [5] shows, modifying Blyth’s conditions toinclude the cancellation case left out by his analysis is not trivial. Clearly,one cannot just change Blyth’s first strict inequality above to a ≤ inequal-ity, because that permits the non- paradoxical case in which A and B areindependent both overall and relative to C.In a similar vein, the mathematician Carl Wagner expresses the paradoxas follows:In general, Simpson’s paradox is said to occur whenever (A, A¯),(B, B¯), and (C1, ....,Cn) are partitions of a sample X and P (A ∣B)γP (A ∣ B¯), while at the same time P (A ∣ B∩Ci)λP (A ∣ B¯∩Ci)for i = 1, ..., n where γ and λ are incompatible relations belongingto the set {≤,<,=,>,≥}. (82, footnote3, page 3)(italics mine).8As the emphasis on incompatible relations of (in)equality shows, Wag-ner’s definition encompasses the outstanding case of an association obtainingwithin the partition which disappears once the data is aggregated overall.One can take an inspired guess that most philosophers discussing Simp-son’s paradox would agree with Wagner [82] and Bandyopadhyay et al. [5]about the inclusion of the cessation case under the aegis of paradox. NancyCartwright, for example, characterizes Simpson’s paradox as follows. Ac-cording to her analysis, Simpson’s paradox is:the fact about probabilities [...]: any association - Pr(A ∣ B) =Pr(A); Pr(A ∣ B) > Pr(A) ; Pr(A ∣ B) < Pr(A) - between twovariables which holds in a given population can be reversed inthe subpopulation by finding a third variable which is correlatedwith both. (10, page 422)Cartwright explicitly identifies the potential for “reversing” overall prob-abilistic independence - Pr(A ∣ B) = Pr(A) - as a form of Simpson’s para-dox. Indeed very few people have demurred on this score. For the reasonI gave above, in this thesis I will follow what seems to be the prevalentliberalized view of the probabilistic form of Simpson’s paradox. Indeed, asnoted above, Simpson’s own original scenarios involved overall independencebetween traits that were associated at the subpopulation level.However, agreeing on the probabilistic conditions necessary to generateSimpson’s paradox does not entail agreeing on whether the probabilisticdescription is sufficient to account for the nature of the paradox. It is tothis second area of controversy that we turn next.1.2.2 Nature of Simpson’s paradoxIn general, the identification of a phenomenon as a paradox has a normativedimension. To refer to a collection of probability statements as constitutinga paradox implies that one is conscious of a threat to intuition, which, byexamining the phenomenon more closely, one hopes ultimately to transforminto insight. But an intuition about what? In the realm of reasoning about9probabilities, intuitions are hardly self-guaranteeing. Yet if one takes theview that an intuition is a commitment to a belief that one knows nothow to do without5, then we must at once begin the search for the coreof indispensable intuitions which (as it turns out) Simpson’s paradox doesnothing to impugn.According to an increasingly popular analysis championed by JudeaPearl, the indispensable intuitions do not involve probability but causal-ity. Simpson’s paradox is counterintuitive in a deep non-trivial sense onlywhen it appears to violate standards for causal consistency; for example,when it appears to suggest that a drug treatment can be beneficial to menand women considered separately yet detrimental in the population as awhole. The misleading suggestion originates in an unwary uniformly causalconstrual of probabilistic statements involving the Simpson’s paradoxicalinequalities. For example, consider:Pr(recovery ∣ treatment) < Pr(recovery ∣ ¬treatment) (1.1)Pr(recovery ∣ treatment,woman) > Pr(recovery ∣ ¬treatment,woman)(1.2)Pr(recovery ∣ treatment,man) > Pr(recovery ∣ ¬treatment,man) (1.3)The mistake is to interpret these as stating that the causal probability ofrecovery is positively influenced by treatment among the population segre-gated by gender yet negatively influenced overall. If we assume that ad-ministering the treatment cannot causally influence gender, despite the factthat gender and treatment must be probabilistically dependent in the datato generate the inequalities, then (1.1) cannot be a statement about causalprobability, unlike (1.2) and (1.3). What is genuinely impossible is preciselywhat a uniform causal interpretation of (1.1), (1.2) , and (1.3) would com-mit us to, namely, that that the presumed causal mechanism responsible forthe effects of the drug in men and women somehow ceases to operate whenthe two sub-populations are aggregated. There is nothing in our under-standing of causation that could possibly licence this opaque and puzzling5Woods [87] page 277.10implication.The unattractiveness of giving a uniformly causal reading to (1.1), (1.2),and (1.3) might seem obvious. But the problem can be brought out in afurther, subtler way. Suppose we wanted to know if we ought to administertreatment to a new patient plucked from the population at large, whosegender is unknown. If (1.1) were to express a genuine causal statementconcerning the efficacy of treatment for individuals, then we ought not toadminister treatment. But the concurrently causal interpretation of (1.2)and (1.3) and the fact that the patient can only be chromosomically maleor female (tertium non datur on this occasion) themselves imply that wewould administer the treatment if we knew the patient’s gender. But in-tuitively our decision to administer or withhold the treatment should notdepend on our knowledge of an attribute which the patient must possess,namely the attribute of being gendered. Conversely, giving up on the causalinterpretation of (1.1) releases us from the conceptual embarrassment of un-accountably conferring an active causal role in the mechanism determiningrecovery to our knowledge of the patient’s gender. More importantly, givingup on the causal upshot of (1.1) ensures that we make the correct decisionregardless of our ignorance of specific gender. Quite clearly, we should treatthe patient in question, since (1.2) and (1.3) explicitly tell us that treatmentpromotes recovery in any individual male or female.Note that similar remarks apply to a counterfactual variant of the cir-cumstance envisaged above, for example, if we considered the hypotheticalquestion whether an untreated patient from the sample of unknown genderwould have fared marginally better or worse had he or she been treated.Whether the patient is male or female, the patient would have had a bet-ter chance of recovery with treatment rather than without. This reasoningis facilitated by being able to calibrate our causal hypotheses at the levelof (1.2) and (1.3). To carry out this process of calibration quite literallyrequires us to drop the conditional probability of gender given treatmentfrom the counterfactual analysis6, which means denying that (1.1) plays a6See the parts in this thesis dealing with the notion of an intervention. As I will explain,we can understand an intervention as a transformation of a probability function from a11corresponding part in counterfactual analysis, just as it cannot consistentlybe a statement about causal effect.To underscore the limits of interpreting Simpson’s paradox in terms ofcausal probabilities, Pearl demonstrates the following result, which is easilyderivable in the logic of causal probabilities:Sure Thing PrincipleAn action C that increases the probability of an event E in each subpop-ulation must also increase the probability of E in the population as a whole,provided that the action does not change the distibution of the subpopula-tion. (62, page 181)Pearl regards this result as the causal analogue of L. Savage’s famoussure thing principle introduced in Savage [72]. It is important to note how-ever that this result does not entail that one can never obtain an instanceof Simpson’s paradox in which all the probabilities are causal. What Pearl’sresult shows is that the causal probability reading of the Simpson’s paradoxi-cal inequalities is inconsistent with a certain class of causal assumptions, notthat it is impossible to obtain it in general. This can be seen by attemptingto show that7:Pr(Y = 1 ∣ do(X = 1)) < Pr(Y = 1 ∣ do(X = 0)) (1.4)is inconsistent with both:Pr(Y = 1 ∣ do(X = 1), Z = 1) > Pr(Y = 1 ∣ do(X = 0), Z = 1) (1.5)Pr(Y = 1 ∣ do(X = 1), Z = 0) > Pr(Y = 1 ∣ do(X = 0), Z = 0) (1.6)Expanding the relevant causal probability terms, we obtain:pre-intervention form to a post-intervention form involving terms literally dropping outfrom the relevant factorization.7The notation do(X) = 1 signifies intervening on X to force it to take value x. Aswill be explained later on in this thesis, probabilities under an intervention are essentiallycausal probabilities.12Pr(Y = 1 ∣ do(X = 1)) = Pr(Y = 1 ∣ do(X = 1), Z = 1)Pr(Z = 1 ∣ do(X = 1)) +Pr(Y = 1 ∣ do(X = 1), Z = 0)Pr(Z = 0 ∣ do(X = 1))Pr(Y = 1 ∣ do(X = 0)) = Pr(Y = 1 ∣ do(X = 0), Z = 1)Pr(Z = 1 ∣ do(X = 0)) +Pr(Y = 1 ∣ do(X = 0), Z = 0)Pr(Z = 0 ∣ do(X = 0))Quite clearly, unless we assume that Pr(Z = 1 ∣ do(X = 1)) = Pr(Z =1) = Pr(Z = 1 ∣ do(X = 0), we cannot balance out the terms to prove that(1.4) is inconsistent with (1.5) and (1.6). But this means that if we are givena causal model in which the independence condition is the wrong assumptionto make, the sure thing result for causal probabilities will not be derivable.In such a situation we have a genuine, causally consistent causal reading ofSimpson’s paradox.On the whole, I take it that Pearl’s main point is that the mere proba-bilistic consistency of (1.1), (1.2), and (1.3) and their ilk, ubiquitous as thephenomenon might be8, is not sufficient to generate an experience of para-dox as justifiably afflicted intuition, for which Pearl has coined the eloquentterm Simpson’s surprise. For Pearl however causal v. epistemic is not adistinction superficially grounded in subject matter as much as it is a ref-erencing of different cognitive dimensions. As he observes, causal relationsor mechanisms in the real world tend to be far more stable compared withepistemic relations, which is why information is often much more expedi-ently encoded as a causal pattern rather than a pattern of probabilities. Forexample, relations of conditional dependence or independence can easily becreated or destroyed by conditioning on further factors in the analysis. Butit is inconceivable to think that causation could exhibit a similar patternof vagary. The upshot for Pearl is that the Simpson’s paradoxical reversalphenomenon is essentially a constraint on causal inference.8For an overview of the pervasiveness of Simpson’s paradox in real life, see Wagner[83].13In a sense, Pearl’s emphasis of the comparative stability of causal re-lations compared with the merely epistemic is somewhat unfair to the lat-ter. He is undoubtedly correct that in general a causal relation is morestable than an epistemic relation which is based solely on relations of con-ditional (in)dependence. But there is a long-standing school of thought inthe Bayesian tradition which emphasizes a form of stability in our evidentialrelations, starting with Keynes’s notion of weight of evidence. Keynes [1921]famously argued that “new evidence will sometimes decrease the probabilityof the hypothesis but will always increase its weight”, and moreover, thatwith more information “we have a more substantive basis on which to baseour evidence”. More recently Skyrms [74] makes the point that the weightof evidence manifests itself not in terms of conditional probability but in theresilience of an individual probability conditioned on various potential datasequences. Epistemic relations based on more information tend to be moreresilient. This may not approach the levels of resilience inherent in a genuinecausal relation, but it does ideally reduce the distance. Note however thatJoyce [37] observes that the resilience of epistemic probability is relative toan individual datum; the same belief (credence) may be resilient with re-spect to one datum and not resilient with respect to another. Susceptibilityof the kind that worries Pearl remains a possibility even under highly ide-alized conditions of having a lot of background information to inform ourepistemic evaluations.Pearl’s causal account of the nature of Simpson’s paradox has convincedsome and alienated others. Wasserman [84] writes that “it is nearly im-possible to explain the paradox without using counterfactuals (or directedgraphs)”. At the other extreme, Bandyopadhyay et al. [5] express the op-posite view that “the premises that generate the paradox are non-causalin character and a genuine logical inconsistency is at stake when a full re-construction of the paradox is carried out”9. To explain the nature of theiralternative account it is necessary to consider a third area of controversy: thecharacterization of what is paradoxical in reasoning in Simpson’s paradox.9Wasserman [84], page 338; Bandyopadhyay et al. [5], page 186141.2.3 Reasoning in Simpson’s paradoxHere is a pattern of probabilistic inference which is roundly invalidated bySimpson’s paradox. To distinguish it from the phenomenon discussed byPearl, I will call it Simpson’s error10:“Simpsons’s Error”The following inference is invalid:Given C, A is more likely if B than if not-B.Given not C, A is more likely if B than if not-B.Therefore,A is more likely if B than if not-B.Simpson’s error may look like a benign argument involving reasoningby cases, but appearances deceive. The comparative relation of being morelikely if one set of circumstances obtains rather than another is not an at-tribute possessed by A. One cannot sum over contrast relations involvingprobabilities the way one sums over mutually incompatible and jointly ex-haustive possibilities in ordinary contexts. Would we have known that thequalitative inference above is invalid had it not been for Simpson’s paradox-ical probabilities to show us the phenomenon in plain sight? Probably not.It seems that one has to reflect on the nature of probabilities, in particularon the nature of thinking about ratios, to really grasp why Simpson’s erroris an error11.Here is an example attributed to the mathematician John G. Bennett12,which helps highlight the common ground between Simpson’s paradox and10Caveat: this should not be taken to mean that Simpson committed an error; no morethan Pearl’s description implies that he dwelt in a state of perpetual surprise.11It would be interesting to know if Simpson’s paradox could be recreated in settingsin which uncertainty is not represented probabilistically but qualitatively. Is there aSimpson’s paradox for possibility measures, for example? One would have to considerthe analogue of the Bayesian notion of conditioning for such frameworks; in the case ofpossibility measures, it is well known that there are several alternatives to choose from.The possibilistic analogue of the paradox would read as follows. Possibilistic Simpson’sparadox: A is more possible if B than if not B, both given C and given not C; yet overall Ais less possible if B than if not B. The question remains an open topic for research. Moregenerally, the issue of generating analogous drastic counterexamples to alluring modesof inference would be especially important to consider since it would shed light on theconnection between a theory of probabilistic and a theory of possibilistic inference.125.15reasoning about ratios:Suppose we have two bags of marbles, all of which are either bigor small, or red or blue. Suppose that in each bag, the proportionof big marbles that are red is bigger than the proportion of smallmarbles that are red. Now suppose we pour all the marbles fromboth bags into one box. Would we expect the proportion of bigmarbles in the box that are red to be greater than the proportionof small marbles that are red? Most of us would be surprisedto find that our usual expectation is incorrect. The big ballsin bag one have a higher ratio of red to blue balls than do thesmall balls; the same is true about the ratio in bag two. Butconsidering all the balls together, the small balls have a higherratio of reds to blues than the bigs balls do.(5, page 194)Bennett’s useful illustration traces the following mathematical fact. From:aa + b>cc + d(1.7)andee + f>gg + h(1.8)it does not follow thata + ea + b + e + f>c + gc + d + g + h(1.9)Does the significance of Simpson’s paradox rest fundamentally with Simp-son’s error, and the fallacious reasoning about ratios that it foreshadows?If this is really what Simpson’s paradox amounts to, then it seems we oughtto distinguish between different forms of paradox, i.e. between a causal anda non-causal form of Simpson’s paradox. The non-causal form would beessentially not a paradox rooted in causal reasoning, but a paradox aboutthe aggregation of statistical information.16There is a long-standing tradition of treating Simpson’s paradox as apurely statistical phenomenon concerning the aggregation of various aspectsof statistical information. Mathematically valuable papers by Good andMittal [24], Mittal [56], Zidek [90] and many others cast much light on thestructural features of the aggregation phenomenon.However, I will argue that contrary to the ratio proposal there is noindependent non-causal form of Simpson’s paradox. I argue in chapter fourthat the aspects of ratio thinking that seem to create their own puzzle are infact nothing more than causal thinking in disguise. To capture this notion,I will draw a distinction between causal structure and causal reasoning. Iwill argue that even in scenarios in which the formulation of the problemdoes not address a specific causal structure, it is possible to reason aboutthe problem as though it did have a specific causal structure behind it.In other words, I claim that is possible to reason about apparently non-causal subject matter in ways which simulate causal thinking. For example,it is possible to view future knowledge as knowledge that arises under anintervention. This is tantamount to treating the model as if it were causal,by breaking the probabilistic dependence between variables as though wewere in fact intervening to generate causal conclusions. But if we reasonabout future knowledge by treating it essentially as a form of knowledgearising by intervention, it isn’t surprising that paradox beckons anew. Ifwe follow the causal account of paradox at least, as I suggest we do, thisconclusion is expected to hold.1.3 Open problemsIn the previous section I laid out three main areas of controversy concerningthe identification, nature, and reasoning in Simpson’s paradox. The purposewas to introduce Simpson’s paradox, and lay the groundwork for severalchapters in this thesis. I will now discuss what I regard to be the openproblems raised by Simpson’s paradox: the resolution of Simpson’s paradox,and its relevance of Simpson’s paradox to epistemology and decision theory.The aim of my thesis is to formulate the problems clearly, and to take initial17steps towards solving them.1.3.1 Resolution of Simpson’s paradoxCan Simpson’s paradox be resolved, and if so, how? To resolve Simpson’sparadox must mean more than just to explain it. It must also imply thatwe are able to accommodate ourselves to the phenomenon. Aspects of suc-cessful accommodation include the ability to grasp the practical significanceof the paradox when the latter arises in decision-making contexts. That isto say, coming to terms with the paradox requires knowing how to proceedin situations in which different courses of action appear to be enjoined byconsidering different levels of association in the data.The idea that the resolution of Simpson’s paradox requires knowing howto proceed was introduced by E.H. Simpson himself. As noted, Simpsonfound himself pinched between a pair of probabilistically equivalent, causallydistinct scenarios. The question he asked was existential: is there an asso-ciation between X and Y , a real one, not merely an artefact of the data?He found (in effect) that the answer was subordinated to the causal detailof the scenario. The “sensible answer” was that when X and Y were con-founded by the existence of a third variable Z, the contingent cancellationof association in the aggregate data was to be ignored. The verdict: X andY were genuinely associated. Not so in the un-confounded collider scenario.The verdict there was: X and Y were not genuinely associated13.The modern discussion concerning the practical import of Simpson’sparadox took up Simpson’s existential question and rephrased it as the ques-tion of how best to proceed. In its new guise, it was re-introduced to themodern audience in a landmark article by Lindley and Novick [49], whopresented another compelling pair of causally distinct but probabilisticallyidentical scenarios, from which we may derive the point that how to proceedis determined by our views concerning the causal structures in the real world13To repeat: Simpson himself did not use causal vocabulary. But the form of hisargument and his insights are clearly causal in nature. Thus I see no harm in reprisinghis account in explictly causal terms.18which generate the data14.By far the most systematic proposal of how to resolve Simpson’s paradoxis Pearl’s. Pearl has argued that the resolution involves the application ofthe so-called backdoor criterion, a well-known graph-based technique for theidentification of a sufficient set of variables for the control of confoundingbias. He argues that the backdoor criterion is capable of predicting whetherthe best way to proceed is in accordance with the partitioned data or withthe aggregate data. It achieves this feat by purely graphical means.Pearl’s claim to have resolved Simpson’s paradox is examined in chaptertwo of this thesis. Against Pearl, I argue that his proposed solution does notfulfill its promise adequately. I will show that the question of how to pro-ceed cannot be answered with full generality on the basis of the backdoorcondition. The backdoor criterion predicts that graph-equivalence entailsdecision-equivalence; faced with a pair of problems sharing the same causalgraph, we ought to make an equivalent decision in both. I will show that,on the contrary, how to proceed requires a finer-grained analysis than thebackdoor criterion permits. In other words, the question of how to proceedcuts across the set of predictions obtained on the basis of the backdoor crite-rion; I will later suggest that resolving “how to proceed” questions involvesa fairly unregimented pragmatics of cause and effect.The problem with Pearl’s analysis in my view stems from a too close as-similation of Simpson’s paradox to the phenomenon of confounding. Pearlgrants that Simpson’s paradox may arise in confounded as well as in un-confounded graphs. But he nevertheless seems to equate the solution toSimpson’s paradox with the solution to the confounding problem i.e. withthe backdoor criterion resolution. What is omitted on this picture is thefact that when Simpson’s paradox arises in an unconfounded graph, theremay still be a problem about how to resolve Simpson’s paradox which thebackdoor criterion simply does not address.14As with Simpson’s original exposition it is equally the case that causal vocabulary isnot explicitly used to untangle the scenarios in Lindley and Novick [49]. But as Pearl hasnoted, it makes much more sense to read the argument in causal terms than it does interms of their preferred notion of exchangeability.19Thus, although graphical techniques are an immensely valuable formalaid, the case for the resolution of Simpson’s paradox is not quite the open-and-shut case that Pearl thinks it is. Much more remains to be done to takeinto account the contextual nature of many of our causal judgements whichultimately drive the question of how best to proceed. The contextual factor,messy though it may be, is in my view ineliminable.1.3.2 EvidenceThe modern debate about the nature of evidence is heir to Carnap’s analysisof the varieties of evidential force in the classic Carnap [9]. Carnap pointedout that the family of confirmation relations by means of which we seek toexpress aspects of evidential significance contains two distinct concepts ofconfirmation, confirmation as firmness and confirmation as relevance. Bothnotions are characterized probabilistically. Confirmation as firmness is areflection of the conditional probability of h given e. Confirmation as rele-vance is a comparative evaluation of the magnitude of the unconditional andconditional probability of h given e. Following Carnap, modern approachesto the probabilistic theory of evidence distinguish between absolute and in-cremental confirmation15:Absolute confirmation ∶ Pr(h ∣ e) > c, for a threshold value c (1.10)Incremental confirmation ∶ Pr(h ∣ e) > Pr(h) (1.11)To confirm h absolutely by means of e is to bestow on h a sufficiently highprobability conditional on e; whereas e confirms h incrementally if condition-ing on e increases h’s probability compared to its prior value. Probabilitytheory offers us an equivalent condition for incremental confirmation, onwhich h is confirmed by e if h is better supported by e than by its negation:Incremental confirmation ∶ Pr(h ∣ e) > Pr(h ∣ ¬e) (1.12)15See also the discussion in Fitelson [17] in relation to the recent contrastivist ap-proach of construing the evidential relation as a logical environment replete with contrast-arguments of the form e rather than e∗ is evidence for h rather than h∗.20Both the absolute and the incremental construal of evidence are quali-tative concepts, addressing the question whether e confirms h, as opposedto assessing how much e confirms h. Notoriously, expanding the theory ofprobabilistic confirmation to include quantitative assessments is a non-trivialtask; whilst a range of inequivalent measures of quantified effects have beenproposed, there appears to be no consensus on which is the correct measureof confirmation, and on whether the pluralism in the quantitative family istolerable.On the other hand, it has been argued that even the qualitative coreof the probabilistic theory of evidence invites serious, if not insuperable,theoretical difficulties. These include the well-known problem of inducedsymmetries in the evidential relation16, the problem of possible divergencebetween evidential support and probabilistic relevance17, and the problemof counterintuitive probabilistic inference, discussed below.At first sight, it might seem that Simpson’s paradox fits perfectly witha general kind of difficulty faced by the qualitative incremental construal ofconfirmation. The general problem is that the mathematical consistency ofcertain sets of statements about probabilities appears to licence correspond-ing counterintuitive statements about the force of evidence. Arguably, theserepresent an undesirable yet inescapable inheritance of the “vagaries” per-taining to the probabilistic framework in which uncertainty is represented.For example, it was noted by Salmon [68] that the individual consistency of(1.13), (1.14), (1.15), and (1.16) implies a corresponding set of intuitivelyimplausible evidential relations:Pr(h ∣ e) > Pr(h ∣ ¬e);Pr(h ∣ e∗) > Pr(h ∣ ¬e∗);Pr(h ∣ e∧e∗) < Pr(h ∣ ¬e∨¬e∗)(1.13)Pr(h ∣ e) > Pr(h ∣ ¬e);Pr(h ∣ e∗) > Pr(h ∣ ¬e∗);Pr(h ∣ e∧e∗) < Pr(h ∣ ¬e∧¬e∗)(1.14)Pr(h ∣ e) > Pr(h ∣ ¬e);Pr(h∗ ∣ e) > Pr(h∗ ∣ ¬e);Pr(h∧h∗ ∣ e) < Pr(¬h∨¬h∗ ∣ ¬e)(1.15)16See for example the discussion in Eells and Fitelson [15].17See Achinstein [1].21Pr(h ∣ e) > Pr(h ∣ ¬e);Pr(h∗ ∣ e) > Pr(h∗ ∣ ¬e);Pr(h∨h∗ ∣ e) < Pr(¬h∧¬h∗ ∣ ¬e)(1.16)According to (1.13) and (1.14), it is possible to disconfirm a hypothesissimply by conjoining or disjoining a pair of individually confirming pieces ofevidence. This seems to stretch credibility to a point. How can conjoiningor disjoining individually supporting bits of evidence fail to produce a con-firming piece of evidence? Does our ordinary concept of evidence bear outthis prediction, or does it resolutely diverge from it? Yet (1.13) and (1.14)are probabilistically consistent, being satisfiable in a simple probabilisticmodel. Moreover, according to (1.15) and (1.16) it is possible to disconfirma conjunction or a disjunction of hypotheses, by bringing to bear a singlepiece of evidence which nevertheless confirms each conjunct (disjunct) sep-arately. This seems at variance with our ordinary conception of evidenceas well. How can a single piece of evidence fail to support the conjunctionor disjunction of hypotheses that it individually and separately confirms?Yet again (1.15) and (1.16) are consistent and easily satisfiable in a simplemodel.Inherited probabilistic behaviour of this kind more or less aptly describedas tracing “vagaries” in the probabilistic notion of confirmation seems tosurface in Simpson’s paradox as well. At first sight, it would appear that thereversal inequalities characterizing Simpson’s paradox contribute anotherexemplar in the same class of impenetrable, unnatural-sounding inferencesabout the force of evidence. For the consistency of the probabilistic (1.17)seems to imply the evidential set of judgments (1.18):Pr(h ∣ e) > Pr(h ∣ ¬e);Pr(h ∣ e, e∗) < Pr(h ∣ ¬e, e∗);Pr(h ∣ e,¬e∗) < Pr(h ∣ ¬e,¬e∗)(1.17)e confirms h, yet disconfirms h no matter what one learns about the partition {e∗,¬e∗}(1.18)As Morton [57] argues, the problem of shifty evidential support expressedin (1.18) has epistemological implications. For suppose that we allow thatbelief can be justified by Simpson’s paradoxical evidence. Then it seemsthat we have to allow that belief could in principle be currently justified,22even though no matter what one goes on to learn about the partition, one’sbelief would cease to be justified in the future. Yet it seems a dictum ofrationality that one should not believe that which one has reason to thinkwould be defeated by future evidence, no matter how the evidence comesin. Morton has dubbed the phenomenon deviant belief. Deviant belief isn’tmerely defeasible belief; it is according to Morton systematically defeasiblebelief, defeated no matter what one learns about the further conditioningfactor. The oddity of this construct is that we may have to allow both thata belief is deviant, and that it is justified.In this thesis, I argue that although there is a problem about the natureof evidence in light of Simpson’s paradox, it is not the problem identified byMorton. The problem of shifty evidence is relatively easy to solve, once wehave distinguished as I will argue later on between reasoning about evidencein a causal sense, and reasoning about it in a news value sense. My view ofevidence in Simpson’s paradox is that it is split between two distinct notionsof evidence, each serving a different role. The alternative view attributableto Morton is that Simpson’s paradoxical evidence is unreliable. The mark ofunreliability in Morton’s view is that it fails to “counterfactually project”.The probabilistic correlation between e and h is said to project counterfac-tually if and only if had e been false, h would have been less likely to obtain(if the correlation is positive); or, if and only if had e been false, h wouldhave been more likely to obtain (if the correlation is negative).Against Morton, I will argue in chapter five that the distinction betweencausal and news value senses of evidence solves the problem; both in thenews value sense and in the cause sense, Simpson’s paradoxical evidence isfully robust under counterfactual supposition. Moreover, in chapter five Iconnect Morton’s puzzle with Newcomb’s puzzle, and show that whilst ananalogous worry may be generated for Newcomb’s puzzle, it can be dealtwith in the same way by distinguishing between two senses of evidentialforce.Nevertheless, solving Morton’s puzzle still leaves us with a serious anddeep problem in the characterization of evidence in light of Simpson’s para-dox. The problem comes out when we consider contexts in which causal23knowledge is represented, not merely probabilistic knowledge of the Simp-son’s paradoxical inequalities. According to the distinction I draw here, thecausal representation of the force of evidence requires us to take into con-sideration causal knowledge of the structural relations in the “real world”which generate the Simpson’s paradoxical data. But the news value senseignores causal knowledge, and goes by overall probabilistic relevance. Bothways of thinking about the force of evidence are required to give a full ac-count of the way evidence works in Simpson’s paradoxical scenarios withspecific causal knowledge.I will present and discuss the problem for evidence arising from causalknowledge in chapter three. It is a fundamental tenet of Bayesian epistemol-ogy that learning new evidence must be accommodated by conditioning onthe evidence, relative to a probability function which expresses one’s totalknowledge. This condition is known in the literature as the total evidencerequirement. But the problem is that the total evidence requirement doesn’tseem to be able to accommodate causal knowledge. It is well known thatcausal knowledge cannot be encoded purely probabilistically by a probabilitymeasure, since probabilistic information underdetermines causal structure.One cannot infer a unique causal structure (a description of processes inthe world generating the data) from a probability function, since a singleprobability function may be compatible with many causally inequivalentprocess representations. The converse aspect of this problem is that cau-sation cannot be encoded by means of conditional probabilities alone. AsI show in chapter three, this has repercussions for the notion of evidence.Whilst Bayesian epistemology employs a single concept of evidence, I sug-gest that in light of Simpson’s paradox, the single concept ought to split intotwo distinct concepts, causal and news-value, one tracking probabilities un-der an intervention, and the other tracking probabilities under conditioning.The formal distinction between intervening and conditioning is explained inmore detail in the appendix.Furthermore, in chapter seven I deal with an interesting puzzle presentedin Aumann et al. [4], which on the face of it looks highly similar to the epis-temic arguments proposed by Morton. As defined by Morton, deviant belief24is rational but systematically defeasible by better knowledge. The notionof belief compatible with the argument presented in Aumann et al. [4] isdeviant in a deeper sense: it is rational belief which in future would have tobe disbelieved no matter what the agent were to go on to learn. I argue thattheir analysis is wrong, and hence that there is no reason to suppose thatrational belief can undergo such radical shifts in light of expected knowl-edge. Gathering up these conclusions, my view is that no case has yet beensuccessfully made for expanding epistemology to include “rational deviance”either of justification or of belief. On the other hand, there is a serious prob-lem that the Bayesian analysis of evidence faces, pertaining to the force ofevidence when causal knowledge is taken into account.1.3.3 Decision theoryThe significance and import of Simpson’s paradox to decision theory has notpassed unnoticed. Several attempts have been made to show that Simpson’sparadox invalidates the famous Sure thing principle in Savage’s decisiontheory, for example in Blyth [8]. In other places, for example in Gibbard andHarper [22], it is argued that we need to distinguish between an epistemicand a causal version of the Sure thing principle, on the basis of models withthe structure of Simpson’s paradox. As I will argue, the implications forthe Sure thing principle posed by Simpson’s paradox are considerably morefar-reaching than previously supposed. Furthermore, there is the question ofthe isomorphism between Simpsons’s paradox and Newcomb-type problems,which itself leads to the question of what the isomorphism entails for decisiontheory. Let us address the isomorphism question first.It has been remarked in, for example Wagner [82], that Newcomb’s prob-lem is probabilistically equivalent to Simpson’s paradox, in the form in whichthe association obtaining between two variables of interest is cancelled byconditioning on a third factor18. Indeed, whether or not commentators ex-plicitly identify the structure of probabilities in Newcomb-type problems asSimpson’s paradoxical, the observation is certainly true. Simpson’s paradox18See also Morton [57], who takes a similar line to Wagner’s.25qua structure of probabilities is thus a latent feature in the decision-theoreticanalysis. But more importantly, it has been proposed that Newcomb-typeproblems are causally equivalent to Simpson’s paradox.Several causal structures have been proposed for Newcomb’s paradox.The simplest and perhaps the most natural is in terms of a conjunctive forkX ← Z → Y , which interprets the conditional independence between X andY given Z as arising from the fact that Z is the sole common cause of bothX and Y . See for example Meek and Glymour [54], Eells and Sober [16],and Joyce [38]. The conjunctive fork model provides an outstandingly goodfit for the famous Fisher-Jeffrey smoking problem, in which conditioning onthe possession of the gene renders smoking and cancer independent. Analternative causal structure is considered in Eells and Sober [16] as an alter-native in terms of an interactive fork. In an interactive fork formation Z isd-separated from X and Y by an intermediate node Z1; i.e. conditioning onZ1 renders X and Y independent of each other and of the common ancestorZ (as predicted by the Markov condition). Note that in Eells and Sober[16] and in the subsequent discussion in Joyce [38] the causal structure ofNewcomb’s paradox is further expanded to make it suitable for the analysisof rational deliberation as a dynamic process. Eells and Sober argue thatthe Bayesian conception of rationality independently requires the additionof a reflective evaluation node, screening off (d-separating) the symptomaticaction (e.g. smoking) from the original cause (e.g. gene). Eells and Sober’sstrategy of tacking on epistemically enriched structure to the causal skeletonof the decision problem is motivated by the need to defend epistemic deci-sion theory from the charge that it cannot produce the intuitively correctsolution to Newcomb’s paradox; and justified by an extrinsic philosophi-cal conception of what kind of causal structure the possibility of rationaldeliberation is inherently predicated upon.Given these two aspects of causal and probabilistic equivalence betweenNewcomb-type problems and Simpson’s paradox, it seems not too outra-geous to claim that au fond, Newcomb’s paradox just is Simpson’s paradoxflying under a different banner, and that consequently, to the extent to whichNewcomb’s paradox serves to differentiate and motivate causal decision the-26ory as against evidential decision theory, the entire debate is built arounda (largely) unacknowledged instance of Simpson’s paradox. This (in partic-ular the latter part of the claim) is an ambitious agenda. Someone mightdeny the claim to centrality by denying that Newcomb’s paradox engen-ders a genuine prescriptive difference among causal and evidential versionsof decision theory. Ratificationist evidential theory for example seeks todemonstrate that the intuitively correct deliverance in Newcomb’s paradoxis not unique to causal decision theory, but can be generated autonomouslyon improved evidentialist principles as well (see Richard Jeffrey’s impressiveand resourceful defence of evidential decision theory in Jeffrey [34], Jeffrey[35]). But conversely, if one thinks ratificationism fails to contribute to evi-dential decision theory the coherent normative basis on which to fulfill thecriterion of adequate prescriptivism, then it is hard to see how the claimof centrality can be resisted. Surely the causal and probabilistic equiva-lence between Newcomb’s paradox and Simpson’s paradox just shows thatdecision theory has been in denial about the real sources of its problems.And yet surprisingly little in the way of an unambiguous and fully workedout response has come out of the structural analysis of Newcomb’s paradoxas a case with the causal and probabilistic structure of Simpson’s paradox.It might be useful here to take a quick census of opinion, even among thehandful of references undertaking a causal analysis of the decision-makingcontext. According to Meek and Glymour [54], the causal analysis of New-comb’s paradox provides an opportunity for reconciliation between eviden-tial and causal decision theory, to the extent that both theories may beshown to obey the same normative principles, but to differ in how they con-strue the concept of a decision. Causal decision theory construes decisionsas interventions; by contrast, evidential decision theory construes them asobservations. Meek and Glymour [54] decline the opportunity to adjudi-cate between these distinct conceptions, indicating that different contexts ofdecision-making might in principle call for either kind of decision. For Wag-ner [82], the Simpson’s paradoxical structure of probabilities in Newcomb’sparadox and affiliated puzzles shows that one must distinguish decision mak-ing under risk from decision making under uncertainty. His analysis yields27two-boxing as the rational act in Newcomb’s problem. For Morton [57],to describe Simpson’s paradox, Newcomb’s problem, and the Fisher-Jeffreygene scenario is to note in each case a set of correlations with differentcounterfactual profiles; some project counterfactually i.e are robust undercounterfactual supposition and some are not. Morton presents an epistemicversion of Newcomb’s paradox to underscore his points. Finally, for Pearl[62], whose main interest is admittedly not decision theory, expected utili-ties should be calculated employing causal probabilities, or probabilities onan intervention. According to Pearl, evidential decision theory is built onfallacious reasoning which improperly treats an action as an item of analysisin the representation of deliberation capable of providing evidence aboutits own causes. Implicitly, the relevance of Simpson’s paradox to decisiontheory is to highlight such impropriety19.In this thesis I will not attempt to adjudicate these views, much asthey appear to be in a state of disarray. Instead I will make the case thatdecision theory has far underestimated the centrality of Simpson’s paradoxin a specific way, to do with the applicability of the Sure thing principle indecision theory.Causal and evidential theory can both be regarded as a specific imple-mentation of the framework presented by Savage in his classic Savage [72].The central result in Savage’s theory is the famous representation theorem,connecting the concepts of rational preference and expected utility maxi-mization. Savage gave an axiomatization of rational preference, and showedthat an agent whose ranking of options obeys his axioms can be representedas if she were maximizing expected utility according to a unique probabilityfunction pr and an almost unique utility function u. It became immedi-ately apparent however that Savage’s theory suffered a serious drawback,which is that the application of his axiomatic principles of rationality is notpartition-invariant.Lack of partition-invariance is a manifold problem. First, it means thatthe determination of the expected utility maximizing act is sensitive to the19See Pearl [62] pages 108-110.28choice of partition; dividing up the logical space in different ways may pro-duce different recommendations at variance with each other. Savage ap-peared to fix this problem by stipulating as a condition of application thatthe selected partition must be independent of the agent’s actions. But asGibbard and Harper pointed out in a landmark article, it is possible tounderstand independence in a causal sense or in an epistemic sense. Theyproposed replacing the sure thing principle with two versions, one requiringcausal independence, and one requiring epistemic independence. In cases inwhich the prescriptions derivable relative to each of these conflict, Gibbardand Harper proposed that we should go with causal independence.I will argue in chapter six that whilst this appears to deal with a simpleform of Simpson’s paradoxical reversals, it does not deal with a generalizedform of Simpson’s paradox, in which the reversal in one partition is reversedrelative to a further partition, and then reversed again, etc. Indefinitely re-versed reversals are certainly conceptually and physically possible. A causalskeleton for generating manifold iterated Simpson’s paradox is presented forexample in Pearl [64], who labels the phenomenon in memorable terms asa multi-stage Simpson’s paradoxical machine. When the causal structure ofthe decision problem is that of a multi-stage paradox, there is little use dis-tinguishing between forms of sure thing principle reasoning. The problemis not that we need to distinguish epistemic from causal applications. Theproblem is applications of the same causal sure thing principle can yielddifferent prescriptions depending on which (causally act-independent) par-tition it is applied to. In turn this means that to be sure we have correctlyapplied Savage’s theory to a decision problem, we need to have substantivecausal knowledge which allows us to say which among the various partitions(and their products) is the relevant one.1.4 Thesis outlineThe remaining chapters are presented in the following order. In chapter twoI address Pearl’s claim that Simpson’s paradox is resolved by the applica-tion of graph theoretical techniques. In chapter three I discuss Simpson’s29paradox in the context of the probabilistic Bayesian theory of evidence. Inchapter four I argue that the alternative non-causal view of Simpson’s para-dox involves causal reasoning in disguise. In chapter five I address Morton’sargument about counterfactual projection of evidential force. The next twochapters are dedicated to decision theory and the implications of Simpson’sparadox for the Sure thing principle. The thesis concludes with a brief dis-cussion of future work in light of the issues broached so far.30Chapter 2Puzzle or Paradox: HasSimpson’s Paradox BeenSolved?2.1 IntroductionHas Simpson’s paradox been resolved? Pearl [64] affirms that it has, con-tending that the successful resolution involves the application of the so-calledbackdoor criterion, a well-known graph-based technique for the identifica-tion of a sufficient set of variables for the control of confounding bias. Ashe writes,Not only do we know which causal structures would supportSimpson’s reversals, we also know which structure places the cor-rect answer with the aggregated data or with the disaggregateddata. Moreover, the criterion for predicting where the correctanswer lies (and, accordingly, where human consensus resides)turns out to be rather insensitive to temporal information [...].It involves a simple graphical condition called “backdoor” (Pearl1993) which traces paths in the causal diagram and assures thatall spurious paths [from one variable to another] are intercepted31by the third variable. [...] armed with these criteria, we cansafely proclaim Simpson’s paradox “resolved”20Commentators following Pearl tend to line up behind the idea that thesting has been conclusively drawn out of Simpson’s paradox. For example,Hernán et al. [27] and Arah [2] write that:The apparent paradox originally described by Simpson is theresult of disregarding the causal structure of the research prob-lem. In fact, any hint of a paradox disappears when the causalstructure is made explicit 21.Although Simpson’s and related paradoxes reveal the perils ofusing statistical criteria to guide causal analysis, they hold nei-ther the explanations of the phenomenon they purport to de-pict nor the pointers on how to avoid them. The explanationsand solutions lie in causal reasoning which relies on backgroundknowledge, not on statistical criteria. It is high time we stoppedtreating misinterpreted signs and symptoms ( “paradoxes”), andgot on with the business of handling the disease (“causality”).22Other commentators appear to go even further, in suggesting that theparadox is not merely resolved but debunked by causal analysis. Wasser-man [84] writes that “a real Simpson’s paradox cannot happen”, for example, “there cannot be a treatment that is beneficial for men and women butharmful overall”23. Wasserman’s skepticism concerning the very possibilityof a genuine, causal instance of Simpson’s paradox echoes a passage in Pearl[62]24. In a similar vein, Peals denies that the notorious drug-treatmentcase can be a genuine causal instance of Simpson’s paradox i.e. an instancein which conditional probabilistic inequalities reflect underlying causation.20 Pearl [64], page 9.21Hernán et al. [27], page 780.22 Arah [2], page 5.23Wasserman [84], page 340.24See Pearl [62] chapter 6, pages 174 - 174, dealing with “a tale of a non-paradox”.32Note however that Pearl would not endorse Wasserman’s blanket skepticismconcerning the possibility of a causal instance of Simpson’s paradox. As weknow, the mark of Simpson’s paradox is a pattern of probabilistic reversal ofinequalities. But the main point raised by Pearl is that causation doesn’t goby conditional probabilities, but by probabilities under intervention. How-ever these can be the same.If conditional probabilities and probabilities under an intervention hap-pen to be the same, and if there is a pattern of reversal relative to a thirdfactor, then we will automatically have both an associational and a causalinstance of Simpson’s paradox all rolled into one. Of course, that is notthe case if conditional and intervention probabilities diverge. Then we mayobtain an “associational” Simpson’s paradox, without obtaining a causalform into the bargain(as for example, in the treatment case above). Thepoint relates back to the backdoor criterion, since the backdoor criterion isdesigned to help us calculate causal probabilities, i.e. probabilities underan intervention. See pages 180-182 in Pearl [62] for a simple formal proofin Theorem 6.1.1 that probabilities under an intervention cannot satisfy theSimpson’s paradoxical inequalities if a certain kind of “backdoor” causalassumption is made.One could argue that Wasserman’s misplaced rejection of the very pos-sibility of a “real” causal instance of Simpson’s paradox is based on a toonarrow range of examples. We may agree that (as Pearl puts it) no “mira-cle” drug can exist that is simultaneously beneficial to men and women andyet harmful to the population as a whole. On the other hand, causal insightdoes not rule out the possibility (as we shall later on in this chapter) thata commonplace drug exists that is detrimental in high and blood pressuregroups individually in a population divided by blood pressure type, and yetis simultaneously beneficial to the population as a whole. Examples of thislatter kind represent causal instances of Simpson’s paradox. We should notlose track of the variety of causal structures that “Simpson’s paradox” quapattern of reversed inequalities can apply to, which can encompass both“real” (causal) and “apparent” (non-causal, purely associational) instances33of paradox25.But what about the more modest proposal that the paradox has beenresolved? Do we have the general means to relegate Simpson’s paradox tothe status of a simple puzzle? Pearl [64] sets out three analytical criteriafor resolving Simpson’s paradox26. First, a resolution of the paradox mustexplain why the phenomenon is considered to be paradoxical i.e. why itis that it elicits educated disbelief or invokes legitimate surprise. Second,it must provide an unambiguous set of conditions for identifying the classof scenarios which permit Simpson’s paradox. Third, and vitally, it mustmethodically provide the correct answer to the question of how to proceedwhen Simpson’s paradox arises in a decision theoretical context.Against Pearl, I argue that his proposed solution does not fulfill thekey third requirement of adequacy. I will show that the question of how toproceed cannot be answered with full generality on the basis of the backdoorcondition. The backdoor criterion predicts that graph-equivalence entailsdecision-equivalence; faced with a pair of problems sharing the same causalgraph, we ought to make an equivalent decision in both. I will show thaton the contrary, how to proceed requires a finer-grained analysis than thebackdoor criterion permits. The strategy that I will adopt to carry out theargument is a fairly familiar one. I will fit different causal stories to anindividual directed acyclic graph, arguing that the uniquely correct answerpredicted on the basis of the backdoor criterion fits only one causal story,not both. This will show that the philosophical question of how to proceedcuts across the set of predictions obtainable on the basis of the backdoorcriterion; I will later on suggest that resolving “how to proceed” questionsinvolves a fairly un-regimented pragmatics of causal effect.25In a later chapter, I argue that the pattern of reversed inequalities is not in and ofitself paradoxical, and that there is only one (causal) form of paradox. At this stage in theargument, the distinction between “real” and “apparent” instances of paradox is a merefaçon de parler borrowed from the texts quoted above.26Pearl [64], page 9.342.2 How to proceed in Simpson’s paradoxHow do questions about “how to proceed” arise in Simpson’s paradox? I willdistinguish here between a how to proceed query under conditions of parti-tional ignorance from a how to proceed query under conditions of partitionalknowledge. In one case we ask: what is the best rational thing to do whenone has no specific information about the partitional factors in the Simp-son’s paradoxical problem? In the other we ask: what is the best rationalthing to do when the relevant partitional information is known? Typically,how to proceed questions in Simpson’s paradox are framed under conditionsof partitional ignorance; it is part of the description of the problem that theagent must choose the rational act could be to perform without knowingwhich partitional alternative obtains.To illustrate, consider a pair of thought experiments from Lindley andNovick [49]. A rational decision-maker must decide how to proceed in twoprobabilistically equivalent scenarios. In one context the decision maker hasto settle the question of whether or not to administer treatment to a patientof unknown gender, based on data in Table 1. In the second context theagent has to decide whether or not to plant a white or black cultivar innext year’s garden scheme, without being able to anticipate how tall it willgrow at maturity, based on the data in Table 2. Suppose we also assumecompatible utilities, for example, that the agent prefers that the patientrecovers and prefers that the yield in his garden scheme is high.It seems reasonable to suppose that if the medical and horticulturaldecisions are sensitive to probabilities alone (given basically the same utilityfunction over outcomes), then the two decisions ought to align with eachother. If nothing else intervenes in the equation except for probabilisticinformation (the same in both cases) and utilities (also equivalent) then itdoesn’t make sense to suppose that the decisions themselves would fail toalign. On this reasoning, it ought to be the case on a priori grounds thatit is rational to treat in one scenario if and only if it is rational to plant awhite cultivar in the other.Note that we have not actually specified whether or not we think that the35MR ¬ R¬MR ¬ RRecovered(R)M (%) ¬M(%)Overall R (%)TreatedUntreated18 127 32 89 2160 2070 305040Table 2.1: Medical case.agent should act in accordance with the overall association, or the partition-based association. In fact, as Simpson’s himself noted in his celebratedpaper, we sometimes think it is rational to follow one prescription, andsometimes the other. As Hernán et al. [27] put it, Simpson’s drew attentionto the fact that[...] even if the conditional A-B association measure is homoge-neous within levels of C, the sensible answer is sometimes theconditional association measure and other times the marginalone.27But we don’t need to choose between measures here to address the two-scenarios argument. We know that the agent’s decision has to be informedby one or other type of association. So if the decision rests on probabilitiesalone, and not on other features such as individual causal structure, then thedecisions must align, since the data is the same in both cases. Conversely, ifthe decisions are imperfectly aligned, then we have shown that (1) the ratio-nal act is determined in principle by either type of association, as Simpsonhimself indicated, but more importantly also that (2) that there is no wayto predict purely on the basis of the common probabilistic features of theproblem what the unique rational thing to do is in a problem of this typei.e. that the nature of the uniquely rational act is incompletely determinedby the probabilistic parameter of the problem.And yet, as Lindley and Novick [49] themselves indicate, the correct in-dividual decisions in the treatment and botanical cases are (a) not to treatthe patient of unknown gender, but (b) to select a white cultivar of the plant27Hernán et al. [27], page 780.36TallY ¬ YShortY ¬ YYield(Y)Tall(%)Short(%)Overall Y (%)WhiteBlack18 127 32 89 2160 2070 305040Table 2.2: Botanical case.of unknown growth habit. What their argument successfully shows is thatdecisions how to proceed are more fine-grained than probabilistic equiva-lence. Despite the fact that the data is the same in both scenarios, in thebotanical case we reason according to the overall probabilistic relevance ofcolour on yield, whilst in the treatment case, we reason in accordance withconditional relevance of treatment on recovery mediated by gender.How can we explain this significant divergence between what seems jus-tified in one case and another, given that the data is the same? Lindleyand Novick [49] do not delve into this question. In my view, the differencein the nature of justification has to be in the causal detail of the story.In the treatment case, we may assume that gender influences treatment insome subtle way in the data set, but clearly treatment assignment does notcausally influence gender. In the botanical case, it seems likely that colourand height are associated with each other owing to possible ancestral causalfactors, for example a genetic component that explains both growth patternand colour, and therefore yield. If we act to withhold treatment, we donot affect gender, which means that it is no more likely that the untreatedperson is male or female, but if we act to select a white cultivar, it is likelythat the plant selected by us is endowed with a genetic structure that willcause it to grow tall, maximizing the expected yield. Indeed if we expressthe justification in terms of modes of reasoning, we must highlight the dif-ference as between abductive and causal reasoning. In the botanical case,the reasoning is partly abductive (and partly causal). The point is not thatthe existence of a genetic causal factor is the best possible explanation forcolour. Rather it’s that the inference from plant is white to plant is probablytall traces the abductive inference plant is white, so probably plant has gene37Z, down to the causal step so, on the basis of having gene Z, plant is probablytall. By contrast, in the treatment case, there is a single causal inferencefrom cause to effect. No abductive step is required, which accounts for theresiliency observed earlier in the face of future knowledge. When we discover(for example) that the plant is in fact short, the earlier abductive inferencethat was supposed to establish tallness is cancelled. This means that thereasoning (though correct) is riskier in the sense that it is more susceptibleto default in the face of contradictory information. In neither story is thecausal component of the reasoning risky in the same sense. We may safelyassume that no possible future information about the partition can overturnthe knowledge that treatment is not conducive to recovery, in both men andwomen, or that tall plants tend to produce higher yields than short ones.2.3 Backdoor criterion and Simpson’s paradoxAs already remarked, Pearl offers a general solution to Simpson’s paradox,which applies in principle to an extensive range of scenarios28. Pearl’s so-lution to the puzzle of Simpson’s paradox is dictated by his solution to thefamous problem of confounding. In fact Pearl seems to think that Simpson’sparadox just is the problem of confounding in its simplest expression. Thisleads Pearl to declare that the search for a solution to Simpson’s paradoxis over, because the problem of how to select a sufficient set for the controlof confounding is resolved by a “purely graphical means”. I argue here thatPearl is wrong to think that an adequate solution to the problem of elimi-nating confounding is sufficient for resolving Simpson’s paradox. I don’t seehow the question could ever be answered without invoking causal knowl-edge. But in my view, resolving Simpsons’ paradox requires more than justidentifying causal structure. It requires a substantive pragmatics of causeand effect, which I will sketch later on it the chapter, but which ultimatelyremains a question to pursue in more detail in future work.Here is a quick overview of how my argument works. As noted, the28Exceptions are cases in which crucial known variables are not measurable, for example.For these Pearl has developed separate techniques, such as the front-door criterion38back-door criterion is purely graph based. Solving Simpson’s paradox byapplying it means that nothing more than the causal graph is required toknow whether one ought to proceed in accordance with the partitioned data.But this way of approaching the problem implies a strong equivalence con-dition. It implies that if the back-door technique is applied to a pair ofgraphically isomorphic models, we should get the same answer regardinghow to proceed. Either the criterion will predict that one must adjust forconfounding in order to determine how to proceed correctly, or it will pre-dict that there is no requirement to adjust to obtain the correct decision, inboth graphs, or neither. But I will describe two different interpretations ofthe same causal graph, in which intuition dictates that the correct decisionin one situation tracks the unadjusted data (overall effect), and in the otherintuition reveals that the correct answer lies in the partitioned data.Moreover, as I will show, in the causal graph that is shared among the twointerpretations there is no confounding, so the back-door criterion predicts inboth that the correct answer how to proceed is given by the unadjusted data.If I am correct about the intuitions involved, it follows that the predictionon the basis of the back-door criterion fits only one of the two interpretedcausal models. The question we ought to ask ourselves is how to supplementthe back-door criterion (or any generalization thereof that remains purelygraph based) in order to get the correct answer how to proceed. What elsedoes our causal reasoning keep track of when we determine how best toproceed in actual Simpson’s paradoxical scenarios?Structurally identifying and resolving Simpson’s paradoxPearl makes two claims on behalf on structural (graph-based) analysis. First,he argues that we can structurally identify the class of structures which arecompatible with Simpson’s paradox. Second, that within the class of struc-tures which are compatible with Simpson’s paradox, we can identify whichstructure requires a decision based on the aggregate data, and which requiresa decision based on the disaggregate data, on the basis of the backdoor cri-terion.39According to Pearl, none of the structures in Figure 2.1, 2.2, or 2.3 iscompatible with Simpson’s paradox, because in all three graphs, the associ-ation between X and Y is collapsible over Z.What does it mean for an association between two variables to collapseover a third? Unfortunately Pearl does not define the crucial concept of col-lapsibility in Pearl [64], where this observation connecting Simpson’s paradoxand collapsibility occurs. In the statistical literature collapsibility is some-times understood as a property of measures of association. A measure ofassociation describes the degree of association between variables. Many dif-ferent ways exist in which we can measure the strength of the associationbetween one variable and another. A measure of association is said to be col-lapsible when the marginal measure can be obtained as a weighted averageof the conditional measures over some factor Z. Examples of a collapsiblemeasure are the standard risk ratio and the risk difference. A notorious ex-ample of an inherently non-collapsible measure is the odds ratio. See Hernánet al. [27] and Greenland [25]. However, we do not need to consider the oddsratio here, because Pearl’s discussion of association reversal seems to con-cern the risk ratio, rather than the odds ratio. Indeed as Hernán et al.[27] point out, the fact that the marginal odds ratio cannot be expressedas a weighted average of the conditional odds ratio is a mere mathematicaloddity when the underlying causal model is not confounded, for example asin fig 2(a). Be that as it may, what Pearl’s three graphs seem to have incommon is that in each of them, Z is either probabilistically independentof X, or probabilistically independent of Y given X29. Since this identifiesthe class of probabilistic functions which cannot be Simpson’s paradoxicalwith respect to X, Y , and Z, I will take the disjunctive condition to bewhat Pearl means by collapsibility in this context. It is easy to see that ifZ is probabilistically independent of X, or probabilistically independent ofY given X, then there can be no reversal of association between X and Y29On page 183 of Pearl [62] Pearl gives precisely this definition of collapsibility, andterms it “the associational criterion of confounding”, which he ultimately rejects albeit ina higly qualified fashion as a criterion of confounding. The intricacies of the debate overconfounding and collapsibility will not be pursued here.40●●●ZXY77//Figure 2.1: A collapsibleassociation incompat-ible with Simpson’sparadox.● ● ●Z YX// //Figure 2.2: A second col-lapsible associationincompatible withSimpson’s paradox.●●●XZY//Figure 2.3: A third collapsible association incompatible with Simp-son’s paradox.41conditional on Z. If Z á X then since pr(Y ∣ X) can be expanded as aweighted average of pr(Y ∣ X,Z) with pr(Z ∣ X) as weights, the indepen-dence condition rules out reversing the association between X and Y . Onthe other hand, if Z á Y ∣X, then by the fact that conditional independenceis symmetrical we obtain Y á Z ∣ X; but if pr(Y ∣ X,Z) = pr(Y ∣ X), thenreversal of association is once again impossible. For the (incomplete) ax-iomatization of conditional independence in terms of the so-called graphoidproperties which include the property of symmetry see Galles and Pearl [18]and Galles and Pearl [19].So let us assume that we have a collection of structures (directed acyclicgraphs), all of which are consistent with Simpson’s paradox. Pearl’s secondpoint is that to know whether to use the aggregate or the disaggregate(conditional or unconditional) probabilities in making a decision how toproceed we need to deploy the backdoor criterion.Consider the graphs shown in Figures 2.4 - 2.7.How do we determine how to proceed on the basis of an analysis of thestructural properties of the graph? Pearl’s back-door criterion gives us theanswer.(Backdoor Criterion) A set of variables Z satisfies the backdoor criterionrelative to an ordered pair of variables (Xi,Xj ) in a given directed acyclicgraph if:(i) no node in Z is a descendant of Xi(ii) Z blocks every path between Xi and Xj that contains an arrow intoXi30.Pearl writes:Armed with this criterion we can determine, for example, that inFigures 2.4 and 2.7, if we wish to correctly estimate the effect ofX on Y , we need to condition on Z, thus blocking the back-doorpath X ← Z → Y . We can similarly determine that we shouldnot condition on Z in Figures 2.5 and 2.6. The former becausethere are no backdoor paths requiring blockage, and the latter30See Pearl [62] chapter 3.42●●●ZXY77//Figure 2.4: Common cause.●●●ZXYOO 77//Figure 2.5: Indirect effect.●●●●●ZXY77$$ Figure 2.6: Collider.●● ●●X ZY  Figure 2.7: Common par-ent.because the back-door path X ← o → Z ← o → Y is blockedwhen Z is not conditioned on. The correct decisions follow fromthis determination; when conditioning on Z, the Z specific datacarries the correct information.31.Suppose that there is no causal confounding, but nevertheless there ex-ists a backdoor path from X to Y containing a collider (Figure 2.6); thenthe back-door criterion tells us to proceed in accordance with the aggre-gate data. Suppose that there is causal confounding, but of the collidertype; then the back-door criterion tells us to proceed in accordance with the31Italics mine. See Pearl [64], page 10. I have rewritten the numbers referring to figuresto match the numbering in this chapter.43aggregate data as well, because to condition on the collider node would ac-tivate the back-door path which is closed as long as we do not condition onit, and thus have the inadvertent effect of generating bias. Finally, supposethat causal confounding is present, via an open backdoor path from X toY going through a fork or a chain link; then the back-door criterion tellsus to proceed in accordance with the disaggregated data by conditioning onthe appropriate set of variables Z. Thus, specific answers how to proceed tospecific problems in which this query is raised follow from the close investi-gation of causal structure in a particular way: namely in accordance withthe dictates of the back-door criterion (if applicable), which is designed toeliminate possible bias, arising from the presence of spurious causal links.Assessing the argumentTo assess the claim that the determining conditions for how to proceed restwith the back-door criterion, suppose we zoom in on just a single set of pre-dictions concerning the graph in Fig 2.5. The back-door criterion predictsthat we should not condition on Z, because the directed path X → Z → Yrepresents a genuine causal effect of X mediated by the intervening node Zon the variable Y . As the first italicized sentence in the quote from Pearlshows, the reason we are enjoined not to condition on Z is, essentially, thatthere are no open back-door paths from X to Y for us to block. Considerhowever the graph under two different interpretations:(First interpretation: medical treatment) Assume that a newlydeveloped medical drug (X) acts on recovery (Y ) in two independent ways.First, the drug acts to significantly reduce blood pressure (Z) within nor-mal bounds, where the blood pressure factor is known to contribute itsown independent causal effect in the aetiology of the disease. Patients withhigh blood pressure tend not to recover, whereas those with low blood pres-sure tend to fare a lot better, ceteris paribus. We may assume that theknowledge concerning the independent effect of blood pressure on recoverycomes from prior independent studies, or it is implied by an appropriate44physiological model. In addition to the effect mediated by Z, suppose thatdisease also contributes a marginal negative causal impact on recovery bypreventing recovery in some further insidious manner, which is apparent inthe rates of recovery in the cohort of patients separated by blood pressuregroup. Nevertheless, insofar as the direct effect on blood pressure, extendedby the effect that blood pressure has on recovery, outstrips in magnitudethe direct negative effect on recovery, the total predominant causal effectof the drug on recovery may well be positive. The scenario described isSimpson’s paradoxical: (1) pr(Y = 1 ∣ X = 1) > pr(Y = 1 ∣ X = 0);yet (2) pr(Y = 1 ∣ X = 1, Z = 1) < pr(Y = 1 ∣ X = 0, Z = 1) and (3)pr(Y = 1 ∣ X = 1, Z = 0) < pr(Y = 1 ∣ X = 0, Z = 0) are both true. [nb. “1”represents the event and “0” the negation].(Second interpretation: thrombosis) Suppose that the contracep-tive pill (X) is suspected of causing thrombosis (Y ) in people (women) towhom it is administered. Nevertheless it is believed that the contraceptivepill acts strongly to prevent pregnancy (Z), which is a major precipitatingfactor in the incidence of thrombosis in the population at large. Supposethat the negative indirect impact on thrombosis mediated by the reductionin risk of pregnancy is of a greater magnitude than the direct positive ef-fect of the pill on thrombosis. Then the data generated by this processis Simpson’s paradoxical: (1) pr(Y = 0 ∣ X = 1) > pr(Y = 0 ∣ X = 0);yet (2) pr(Y = 0 ∣ X = 1, Z = 1) < pr(Y = 0 ∣ X = 0, Z = 1) and (3)pr(Y = 0 ∣X = 1, Z = 0) < pr(Y = 0 ∣X = 0, Z = 0) are both true.Suppose that the practical goals of decision-maker are (i) in the treat-ment case, to promote recovery (ii) in the thrombosis case, to promote healthby preventing the incidence of thrombosis. Taking these goals into consid-eration, the back-door criterion predicts that the decision-maker should ad-minister the treatment in the first scenario and administer the contraceptivepill in the second scenario. But our intuitions dictate that this is the correctdecision in the medical case, and the incorrect decision in the thrombosiscase. As I shall argue, in the latter, our intuition dictates that the rational45decision should be in line with the direct effect of X on Y , and in the former,with the total effect, including the indirect effect mediated by Z.The example involving thrombosis was originally proposed by Hesslow[29] as a test case for the theory of probabilistic causation. It was arguedby Otte [60] that Hesslow’s example is a form of Simpson’s paradox thatCartwright’s theory of probabilistic causality cannot handle, insofar as itprohibits conditioning on pregnancy qua node descending from X on thecausal path to Y . Eells [13] defended Cartwright’s analysis of causationagainst Otte’s objection, and argued that Otte’s own proposed revisions ofCartwright’s theory fail to secure an adequate response in the thrombosiscase.These well-known difficulties in the literature on probabilistic causationare inspired by a single central intuition, which is that in the thrombo-sis case, the relevant causal effect that our decisions and evaluations ofcausative impact ought to track is the direct impact of the contraceptivepill on thrombosis. But the back-door criterion also prohibits conditioningon Z (pregnancy). It too runs foul of the central intuition about direct ef-fect. Yet the correct reasoning about these cases goes as follows:• Grant that treatment increases the chances of recovery by loweringblood pressure, a major contributing risk of death. Then we shouldadminister treatment, as the sole best way to proceed, *despite* thefact that if we hold blood pressure group constant, treatment maymarginally increase risk of death, through some further independenteffect, which is outstripped along the major causal path.• Grant that the pill lowers the overall risk of thrombosis, by loweringthe chance of pregnancy, a major contributing risk of thrombosis. Andyet we should not administer the pill, *because* if pregnancy is heldconstant, the pill increases the risk of thrombosis, showing that theindependent causal effect, albeit outstripped along the major causalpath, is the causal effect that matters to the decision how to proceed.46How can we account for these intuitions, and can we coax them into aserviceable theory to restrict the application of the backdoor criterion untilit gives the right results in the right cases? If in some cases, we do not go bythe predictions of the backdoor criterion, can we give a systematic accountof why not? This is a major task for causal theory, and I will merely attemptto sketch how the account might go, noting that the immediate lesson maybe (ultimately) the most valuable: a simple yet effective argument as aboveshows that we cannot apply graphical analysis without first untangling moreof the causal structure than the graph can actually capture. Here are somefeatures of analysis that I think an adequate pragmatics of cause and effectshould encompass:Relevant alternatives Determining how to proceed in Simpson’s para-doxical cases should require us to envision alternatives to the proposedcourse of action by substitution for the intermediate causal link. In thethrombosis case, it seems plausible to suppose that there may be easy al-ternatives to manipulating the risk of pregnancy by administering the con-traceptive pill. Given the minor independent risk of thrombosis that thelatter action carries, it would be better if the reduction in risk of pregnancywere effected by alternative means. In the medical case, we do not know ifsuch alternative manipulation of blood pressure factor exists, because thisdepends very strongly on the biological nature of the disease. Perhaps thetreatment developed to target the disease is the only treatment of its typethat can mediate recovery by affecting the determining factor of blood pres-sure. This makes the indirect causal path all the more valuable, and justifiestaking the independent risk contributed by the treatment, if blood pressureis held fixed.Priority Determining how to proceed in Simpson’s paradox requires usto assess the nature and stability of the causal links, in order to prioritizecertain causal paths over the others. For example, the causal path linkingthe contraceptive pill to thrombosis directly ought to have priority over thecausal path mediated by pregnancy, because the former represents an inher-47ent and robust physiological relationship, whereas the latter is affected bybroader factors not in the analysis, such as age, reproductive health, socialclass, and a wide range of factors outside the model. The ranking must payheed to the fact that the strength of the influence along the indirect causalroute can be affected by extrinsic factors. Thus, by making a decision toavoid pregnancy e.g. by abstinence or other means, one can mediate theinfluence that the contraceptive pill has on pregnancy. At one extreme ina population of totally abstinent individuals the contraceptive pill will haveno effect on pregnancy, because in this context both taking the pill and nottaking it will result in no pregnancy. The discernible effect of the pill onreproduction (if there is one) requires some level of exposure to the risk ofpregnancy, both if one resorts to taking the pill, and if one does not. Hence,the causal link between pill and pregnancy without which the total effect ofthe pill on thrombosis would be positive is not robust with respect to exter-nalities. By contrast, one may not easily mediate the direct causal influencebetween the pill and thrombosis, assuming that the pathway through whichthis operates is physiological and internal to the organism.Sensitivity to change in parameters Determining how to proceedin Simpson’s paradox requires understanding how change may occur in themodel if the relevant parameters are altered, in other words to asses howrobust our conclusions about how to proceed are to model manipulation.Consider a joint probability function pr(X,Y,Z) and suppose that it is fac-torized in accordance with the graph in Fig 1(a) :pr(X,Y,Z) = pr(X)pr(Z ∣X)pr(Y ∣X,Z)Suppose we want to determine the total causal effect of do(X = 1) com-pared with the effect of do(X = 0) on the marginal probability distributionof Y . That is suppose we want to know the total causal effect of one possiblevalue of X rather than another. Since there is no confounding, the effect ofintervening to force X to take a value is just the same as conditioning onthe event that X takes that value:48pr(Y = 1 ∣ do(X = 1)) = pr(Y = 1 ∣ see(X = 1)) == pr(Y = 1 ∣ X = 1, Z = 0)pr(Z = 0 ∣ X = 1) + pr(Y = 1 ∣ X = 1, Z = 1)pr(Z =1 ∣X = 1), andpr(Y = 1 ∣ do(X = 0)) = pr(Y = 1 ∣ see(X = 0)) == pr(Y = 1 ∣ X = 0, Z = 0)pr(Z = 0 ∣ X = 0) + pr(Y = 1 ∣ X = 0, Z = 1)pr(Z =1 ∣X = 0).Suppose we set pr(Y = 1 ∣ X = 1, Z = 1) = pr(Y = 1 ∣ X = 0, Z = 0) = 0.For example, read “Y=1”as “death”, “X=1” as “ingests acid” and “Z = 1 ”as “ingests alkali” in the famous acid and alkali example due to Cartwright[10]32. Then the total causal effect of setting X to have one value versusanother simplifies to:pr(Y = 1 ∣ do(X = 1)) == pr(Y = 1 ∣X = 1, Z = 0)pr(Z = 0 ∣X = 1),pr(Y = 1 ∣ do(X = 0)) == pr(Y = 1 ∣X = 0, Z = 1)pr(Z = 1 ∣X = 0),If the first two terms in these two equations are close together, e.g. if theprobability of death given one has ingested either poison on its own is prettymuch the same, then whether or not acid causes death or prevents deathdepends on how the probability of not finding alkali in time compares to theprobability of ingesting alkali having not previously taken acid. But thatseems to depend on the model in an intolerable way. Why should the causaleffect of taking acid on death have anything to do with the circumstanceswhich relate taking one type of substance to the other? By pushing theprobability of one arbitrarily high compared to the other, we can tweak totalcausal effects as we please. But surely a causal relationship should be beyond32 See also Eells’ discussion of this example in Eells [14] section 3.149such tweaking. All the more reason to be careful about which type of effectwe want our decisions how to proceed to track. If we’re not careful aboutthat we might end up with the intuitively wrong recommendation e.g. totake acid in order to prevent death. What kind of recommendation we wouldwant from the model depends on the complexities of the case. Indeed itdepends on how easy it would be to effect physical change in the world beforeany other kind of model-dependent change. The best recommendation wouldbe tripartite: keep the poisons securely locked up, but together, so that ifyou take one, the probability of taking the other is 1. But on any level that’sobviously a good recommendation because acid and alkali each directly andindependently cause death, regardless of how patterns of co-ingestion occurin the wild.2.3.1 Decision theory and the problem of the Bad PartitionIt is worth noting here that the problem affects not just the backdoor crite-rion, but also a central claim of causal decision theory, which is that parti-tions of the logical space which are known to be causally dependent on theaction space ought to be proscribed from the calculation of expected causalutilities.It is well-known that the original and immensely influential form of de-cision theory due to Savage (see Savage [72]) is susceptible to (apparent)counterexamples, which involve partitioning the logical space in such a waythat the events over which the action space unfolds turn out to be proba-bilistically dependent on one’s choice of action. Consider for example thefamous deterrence problem33. A vandal threatens to break the windshieldof your car unless you pay him a small amount of money (smaller than itwould cost to replace it). You have every reason to think that the threatis credible. In this critical situation one could (unwisely) reason as follows:either the window will be smashed by the vandal (p1), or it won’t (p2). Ineither case, one would rather have the small amount of money than not.Hence, the reasoning concludes that one ought not to pay the protection33Jeffrey [34] pp 8-9. An excellent discussion can be found in Joyce [36] pp 115-117.50money. The fault in this reasoning lies in the obvious fact that the proba-bilities of p1 and p1 depend on your choice of action. It is virtually certainthat if you don’t pay, p1 will obtain, and if you do, it’s fairly probable thatp2 will obtain.Savage himself responded to the problem by denying that the decisionproblem in the deterrence scenario is well-posed. In order for the principlesof decision theory to apply, he replied, the partition over which expectedutilities are computed must be probabilistically independent of the actionspace. As it was pointed out by Gibbard and Harper [22] (see also Joyce[36]) the notion of probabilistic independence is ambiguous between newsvalue evidential independence and causal independence. In the former newsvalue sense, independence requires that the action space should not providethe decision maker with evidence that one rather than another cell of thepartition obtains. In the causal sense, independence requires that the ac-tion performed or contemplated should not bring about an increase in theprobability of one or another cell of the partition.Evidential decision theory and causal decision theory differ over whichkind of independence is required to render a problem admissible. Evidential-ists argue that both forms of independence must be satisfied; causal decisiontheorists require only causal independence34. More specifically, according tocausal decision theory, an admissible partition should satisfy the criterionof causal independence:(admissibility) An admissible partition for the calculation of expectedcausal utilities must be causally independent of the action partition.The restriction to causally independent partitions deals with the deter-rence problem successfully. But it seems that the type of example thatchallenges the back-door criterion also challenges admissibility. For considerthe thrombosis scenario, and consider the partition Q: q1: the subject ispregnant; q2 the subject is not pregnant. Let A be the action space com-34For an exposition of this point see Joyce [36] pp 150-151.51prising the actions a1 : administer the contraceptive pill; a2 withhold accessto the contraceptive pill. The background causal knowledge comprises theinformation that the contraceptive pill is causally relevant to the pregnancyoutcome for the subject. Therefore, we may assume that the partition Q isnot causally independent of the action space A. But admissibility proscribesthe use of causally dependent partitions in the calculation of expected util-ities. Yet as we have seen, the sensible decision requires reasoning on thebasis of the partition by the pregnancy factor.In short, the problem for causal decision theory is that we need a prin-cipled way to distinguish not just between good and bad partitions, but alsobetween obviously good and obviously bad applications of the restriction tocausally independent partitions. Until such a way is found, causal decisiontheory may be no better off than the back-door criterion in this respect.2.3.2 Objection from the nature of arguments based oncausal isomorphismsBefore I conclude, I want to address an objection to the form of argumentpresented in this chapter. In the literature on actual causation there is asegment of opinion to the effect that arguments from isomorphism do not ingeneral achieve their target, because arguments from isomorphism coarsenthe nature of the causal problem in order to get the alleged identity ofstructure.As a representative expression of the anti-isomorphism opinion, considerthe arguments presented in Blanchard and Schaffer [7] (to appear). The keypoints (apart from their discussion of a few specific causal models by way ofillustration which I will not go into here) can be broken down into four maincontentions. (1) The authors argue that one can get “structural isomor-phism” on the cheap, by “ignoring causal structure”. Such models by theirvery nature are therefore non-apt. An apt model is one which is capable ofrepresenting known causal structure adequately. (2) The authors argue thatnon-apt structural isomorphisms are behind some of the arguments that callfor default information to supplement ordinary structural equation modelsso as to aptly capture actual causation. (3) They seem to suggest that the52mark of non-aptness is instability: when causal structure is added to makethe non-apt models apt, by addressing the missing information explicitly,causal verdicts are overturned. (4) They also suggest that arguments thatappeal to “structural isomorphisms” are inherently suspect. As a heuristicfor an inquisitive mind on its guard against dubious implications they offerthe following piece of wisdom: “When confronted with structurally isomor-phic but causally distinct cases, suspect that at least one of the models isimpoverished or otherwise non-apt”35.None of these arguments if granted touch the main conclusions in thischapter. The point I am making is as follows. Regardless of the particularmerits of one causal model or another, it has to be granted that in general,our judgements about causation sometimes track total causal effects, andsometimes track direct causal effects. In some places in the literature, directcausal effects have been referred to as path specific causation 36. Whenjudgements of causal relevance track direct effects, the deliverances of thebackdoor criterion (or formal alternatives which hit the same target) willconflict with intuition of how to resolve Simpson’s paradox pragmatically, incontexts which call for rational decision-making. The point applies to anyother theory which is designed to deliver total causal effects, as opposed todirect or path specific effects. It is therefore wrong to look to the concept ofisomorphism as the source of our troubles. To reach the real source of ourtroubles, one must refocus the theory, and learn how and when to do so ina systematic way.2.4 ConclusionHas Simpson’s paradox been solved? In my view, it remains a substantiveparadox in several ways.One way in which Simpson’s paradox remains a paradox is that it permitsinstability of rational evaluation. In the treatment case introduced in the35 page 25 in Blanchard and Schaffer [7], draft. Note that according to the authors, thereare other reasons for non-aptness to consider, but non-aptness via missing out informationseems to be the one they favour36See Hitchcock [31] for more detail.53celebrated article by Lindley and Novick [49], the correct decision is not totreat, and one can anticipate that at tfuture when one’s patient is revealed asbeing either female or male, one’s retrospective evaluation will be in perfectagreement with one’s current evaluation regarding the capital quality ofbeing the correct way to proceed. But in the horticultural case, given thatthe correct decision is to select a white cultivar, one can anticipate thatone’s hindsight evaluation under conditions of better knowledge (when theplant’s height characteristics become apparent) will be at variance with one’scurrent views about the matter. However counterintuitive this predicamentseems, the instability is chronic and ineliminable37.The second main way in which Simpson’s paradox does not have a solu-tion is that it draws on expert causal knowledge in ways which are unpre-dictable on the basis of Pearl’s backdoor criterion. The backdoor criterion isthe simplest answer to date to the question of how to control for confoundingbias. In-house generalizations of the backdoor criterion exist, for examplethe front-door criterion38, an extended technique for the elimination of la-tent variables whose suspected spurious influence may not be measurableby practical means. But although these methods represent immensely valu-able techniques for understanding and finding practical solutions to manytheoretically complex problems, we must still be prepared to address philo-sophical questions at ground level. On the view defended here, it is correctto argue as Pearl does that the appropriate approach to Simpson’s paradoxis through causal reasoning. I think it is fair to say that we cannot even be-gin to answer questions about how to proceed without invoking substantivecausal knowledge which identifies potential confounders and gives us as fulla picture of the underlying causal structure as possible. But I argued that itis mistaken to think that an inflexible application of a technique such as the37It is worth noting that the very idea of instability struck Lindley and Novick [49] aspreposterous: “The apparent answer is, that when we know that the gender of the patientis male or when we know that it is female we do not treat, but if the gender is unknownwe should use the treatment! Obviously that conclusion is ridiculous” (page 45,Lindleyand Novick [49]). As it transpires, one is able to get rid of the ridiculuous conclusion forthe treatment case, but not for the planting.38See Pearl [62] for more detail. An alternative method is developed and presented inLauritzen [43].54backdoor criterion can yield the sophisticated and subtle evaluations thatinform case-by-case our best decisions how to proceed. The relevance andchallenge of Simpson’s paradox lies precisely in the fact that it forces us torefine and reexamine our assumptions about cause and effect, and how thedata speaks to them.55Chapter 3Two Notions of Evidence3.1 IntroductionNotoriously, Simpson’s paradox poses considerable problems for reasoningabout cause and effect on the basis of relations of probabilistic relevance. Inthe literature on probabilistic causality Simpson’s paradox was addressed byCartwright [10] and Skyrms [74] in relation to the simple probabilistic anal-ysis of causation, which assesses the statement that C causes E in terms ofC raising the probability of E. Under pressure from counterexamples arisingfrom Simpson’s paradox, the simple analysis was modified by Cartwright [10]to: C causes E if and only if C raises the probability of E relative to everyway of holding fixed a set of appropriate factors, namely those which are (a)causally relevant to E and (b) not causally influenced by C i.e. all causallyrelative factors except for C and the effects of C. In the variation due toSkyrms [74], the condition for holding fixed an appropriate set of causalfactors was replaced with a condition resembling “Pareto-dominance”. Thissays that C causes E if and only if C raises the probability of E relative toat least one way of holding fixed the causally relevant factors, and lowers itin none39. In view of the nature of the modifications imposed by Simpson’sparadox on the probabilistic construal of causation, its proponents could no39For interpretation and criticism of these proposals see Eells [13], Hardcastle [26], Otte[61].56longer claim to perform a reductive analysis of causation, since causal termsappear explicitly in the specification of the factors which we are required tohold fixed. In the view of Pearl [62], the program of probabilistic causality isessentially dead in the water, due in part to the difficulty of resolving theseissues.In this chapter I will argue that Simpson’s paradox also poses a consid-erable problem for the popular Bayesian theory of evidence. The problemis that Bayesian epistemology employs a single concept of qualitative con-firmation, whereas I will argue that Simpson’s paradox forces the notion ofevidence to split into a “causal” sense of evidence, and a “news value” senseof evidence. The distinction is underwritten by the fact that the causalrepresentation of the force of evidence requires us to take into considerationcausal knowledge of the structural relations in the “real world” which gen-erate the SP data. But the news value sense ignores causal knowledge, andgoes by overall probabilistic relevance.According to the familiar Bayesian conception of incremental evidence,e is evidence for h if learning e increases the probability of h. It is a fun-damental tenet of Bayesian epistemology that learning new evidence mustbe accommodated by conditioning on the evidence, relative to a probabilityfunction which expresses one’s total knowledge. This condition is known inthe literature as the total evidence requirement. But the problem is that thetotal evidence requirement doesn’t seem to be able to accommodate causalknowledge. It is well known that causal knowledge cannot be encoded purelyprobabilistically by a probability measure, since probabilistic informationunderdetermines causal structure. One cannot infer a unique causal struc-ture (a description of processes in the world generating the data) from aprobability function, since a single probability function may be compati-ble with many causally inequivalent process representations. The converseaspect of this problem is that causation cannot be encoded by means ofconditional probabilities alone. The total evidence requirement simply doesnot seem to address causal knowledge in any shape or form.It is not clear how the Bayesian theory of evidence can respond to thisproblem. In this chapter, I will suggest that we need to acknowledge different57ways of thinking about the force of evidence, which I will characterize asarising when probabilities are considered under an “intervention” or underan act of “conditioning”. Whether a Bayesian theory can accommodatethe distinction between reasoning about the force of evidence in these twodifferent ways remains to be seen. The aim of this chapter is to show thatthe Bayesian theory needs to be modified to cope with Simpson’s paradox.First, I will present a selective overview of the implications that Simpson’sparadox poses for reasoning about causation. Then I will explain how theproblem reverberates for the Bayesian analysis of evidence. In the rest ofthe chapter I will flesh out the two sense of evidence I claim are needed toaccount for evidential reasoning in Simpson’s paradox.3.2 Causation and evidence3.2.1 Causation and Simpson’s paradoxConsider a classic Simpson’s paradoxical scenario due to Skyrms [74]40. Sup-pose that a recent observational study into the effects of smoking on publichealth has identified an overall negative association between smoking andpulmonary cancer (good news, apparently, for the smokers). Yet the studyhas also identified a positive association between smoking and cancer, if thedata is stratified by geographical location (city or country). The data (wemay further suppose) arose through a complex interaction of environmentalfactors and human psychology. Pollution is higher in cities than it is in thecountryside. Knowing this fact, individuals who live in cities tend to refrainfrom smoking, for fear of putting their lungs into “double jeopardy”. Theconverse of this phenomenon is that country-folk tend to take up smokingin droves, exhibiting a remarkable degree of risk-courting hubris. They feelthemselves just safe enough not to be really safe. Thus through a quirk ofpsychology smoking is in fact positively relevant to living in the country-side. But pollution is a much bigger health hazard than smoking; so therelative protection afforded by clean air to one group, coupled with the rel-40 Skyrms [74], page 106.58ative exposure experienced by the other group, means that overall, smokingis negatively relevant to disease.Examples of this type may be used to illustrate the critical distinctionbetween associational concepts and causation. Relations of association be-tween variables are reflected as conditional probabilities, encoded in theprobability function itself. But the fundamental problem is that causalstructure (the depiction of processes in the real world which generate thedata) cannot be encoded as a probability function, because causation isunderdetermined by conditional probabilities. This means several things.First, it means that any given probability function does not in general de-termine a unique causal structure, but may determine a class of causallyinequivalent depictions of causal relations41. To find out which among theprobabilistically equivalent causally inequivalent models is the correct onerelative to a given problem is a substantive task for causal analysis. Second,it means that causal reasoning cannot proceed unsupplemented on the basisof conditional probabilities alone. We cannot infer causal structure fromthe purely probabilistic data. Neither causal relations nor causal effect maybe ascertained or quantified simply on the basis of relations of probabilisticdependence and independence.To exemplify the first point, note that either graph 3.1 or 3.2 can aptlyrepresent the probabilistic information in Skyrms’ story. As far as the jointprobability distribution is concerned, the two graphs are equivalent. In com-mon sense terms: observing either process in the real world would not makea difference to the study. But only graph 3.1 is an adequate representation ofthe actual causal process that is supposed to have generated the Simpson’sparadoxical data. Letting the random variables X,Y,Z represent smoking,cancer, pollution, the causal assumptions expressed in the smoking scenariolicence the graph in which Z is a direct causal influence on X and Y , whilstX has a direct causal influence on Y . But this is equivalent to observing a41These are sometimes called Markov equivalent. For example, a probability functionwhich is defined by the condition that X is independent of Y given Z can equally representany of the following causally distinct but Markov equivalent graphs (I) X ← Z → Y (ii)X → Z → Y (iii) Y → Z →X. Another example is given in the text below.59●●●ZXY77//Figure 3.1: Guilty plea-sures.●●●ZXYOO 77//Figure 3.2: An exodus ofsmokers.different causal process in which X is a causal influence on Z, for example,when people move around as a result of smoking habits, so that we mightsee, e.g. an exodus of smokers to the country, and an influx of non-smokersto the city.As regards causal inference, note that inferring from the fact that smok-ing is negatively relevant to cancer that smoking has a negative causal ef-fect on cancer is palpably a mistake. Although there is an overall negative60correlation between smoking and cancer, is it largely spurious, due to theexistence of a common cause, pollution. As the example illustrates, positivecausal relevance is perfectly compatible with overall negative probabilisticrelevance. Furthermore, positive causal relevance is also compatible withnegative probabilistic relevance conditional on the partition by Z, as can bedemonstrated by means of similarly constructed examples. In short, neitherthe inference from overall association nor from partition-relative associationto causal effect can claim to identify or support its conclusion adequately.3.2.2 Evidence and Simpson’s paradoxAs the previous discussion showed, causal knowledge is not reducible toknowledge of probabilities. In the Bayesian framework a rational agent’scredal state (his rational opinion) is usually represented in terms of a prob-ability function or a set of probability functions the members of which aresupposed to reflect the agent’s knowledge. But if we attempt to model whatthe agent knows in a Simpson’s paradox by a probability function, we can-not claim to have represented what he knows about the causal nature ofthe processes which generated the data. Compare two rational agents whoshare the observational data above, one who believes that smoking causallyinfluences location, and one who believes that location influences smoking.The first believes that graph 3.2 correctly sums up the causal facts, the sec-ond believes it is graph 3.1. Their disagreement concerns the causal facts ofthe matter, and not the data, which is common ground between them. Buttheir disagreement will not show up in their credences or rational degrees ofbelief, understood in the usual way as conditional probabilities. To be sure,both agents will aptly be represented as believing that e.g. getting canceris less likely if one smokes than if one doesn’t, or that cancer is more likelyif one smokes than if one doesn’t in cities as well as in the countryside, etc.As far as that goes, a probability function will do the job, but representingknowledge or disagreement about causal fact is bound to elude the task atthis level. We won’t find such knowledge encoded in the form of a probabil-ity function. It follows from this that the knowledge attributed to an agent61whose opinion is encoded purely probabilistically is incomplete.If the knowledge that the simple Bayesian framework can represent is inan important sense incomplete, then it cannot satisfy its own central require-ment of total evidence. The total evidence requirement was originally pro-posed by Carnap as a criterion for the applicability of inductive inference42.The principle was intended to secure a solution to an immediate problemarising from an “uninhibited use of statistical syllogism” 43 yielding mutu-ally inconsistent conclusions. Appealing to the total evidence requirementdeemed the inferences in question illegitimate, because less than the totalamount of evidence had been used in making the inferences. Since Carnapthe principle has been internalized by Bayesian epistemology, which (havingfound it both congenial and compelling) both takes it for granted, and seemshappy to leave it vague. It has been treated both as an epistemic norm (anorm of epistemic rationality, governing the justification of rational belief)and as a norm of practical rationality (a norm referring to the conditions forrational action). For example in McLaughlin [53] it is viewed as an epistemicnorm. It states that it would not be rational to choose just any evidentialbase to determine the rational credibility of a hypothesis. Satisfying the corerequirement of total evidence requires that the agent ought to gather theoptimal amount of evidence, which is not necessarily all the evidence, butthe amount that maximizes epistemic utility. In Good [23] it appears less asa norm governing rational credibility than as a principle of practical ratio-nality imposed by the requirement to maximize expected utility. To answerAyer’s objection that on the Carnapian theory of logical probability it is not42Carnap [9].43cf Suppes [77] page 1. As an illustration of uninhibited statistical syllogism, Suppesconsiders the following. (I) The probability that Jones will live another fifteen years giventhat he is now between fifty and sixty years of age is r. Jones is between fifty and sixtyyears of age. Therefore, the probability that Jones will live another fifteen years is r (II)The probability that Jones will live another fifteen years given that he is now betweenfifty-five and sixty-five years of age is s. Jones is between fifty-five and sixty-five yearsof age. Therefore, the probability that Jones will live another fifteen years is s. If s ≠ r,then it appears that the two inferences are athwart one another. On the total evidenceresolution of the problem, both (I) and (II) are illegitimate, since (as it appears from thepremises) Jones is specifically between fifty-five and sixty years of age, which is far moreexact knowledge than in (I) or (II).62clear why we should seek further new evidence, Good replied that “it paysto take into account further evidence, provided that the cost of collectingand using this evidence, although positive, can be ignored”. Once again inKelly [41] it figures as an obvious norm of rationality. It states in this guisethat to the extent that what is reasonable to believe is founded on one’sevidence, what has evidential bearing is one’s total evidence. This is be-cause justification which depends on a proper subset of one’s total evidenceis susceptible to being defeated by yet further evidence44.How important is the total evidence requirement to the Bayesian frame-work? According to a modern perspective on TER, the principle is neededto guarantee the coherence of epistemic Bayesianism, the view that eviden-tial relations are best construed probabilistically45. Joyce [37] has arguedthat “much of what Bayesians say about learning and confirmation onlymakes sense if probabilities in credal states reflect states of total evidence”46.Specifically, this refers to the claim that learning ought to be construed asBayesian updating, and the claim that questions about confirmation canbe answered on the basis of structural features of credal states (i.e. of theprobability functions which represent them). Joyce has further distinguishedbetween three dimensions of evidential force captured in a precise way bythe probabilistic framework. The balance of evidence is the magnitude of theprobability of the hypothesis in light of the evidence; it is captured accord-ing to Joyce by the individual probability value accorded to h. The weightof evidence is the concept introduced by Keynes to refer to the amount ofevidence for a hypothesis, bearing in mind that weightier evidence need notentail a greater balance, since the accrual of data may disconfirm as wellas confirm h. Joyce follows Skyrms [74] in arguing that weight is evincedby the stability of the probability of h i.e. its resilience in face of futureevidence. Finally, specificity of evidence is the degree to which e discrim-inates among competing or alternative hypotheses, and is reflected by thespread of probability value over the entire credal state. Yet as noted every44Kelly [41], pages 938-939.45See Joyce [37].46 Joyce [37], page 154.63dimension in which the force of evidence in unpacked requires us to abide bythe requirement of total evidence; so if Bayesian theory can’t satisfy its ownrequirement, it’s unclear how much of the defence of these claims remainsstanding.As I noted in the introduction, I do not know if epistemic Bayesianismmanages to escape this problem. But I do think that it does face a problem,and it needs to address it. In the rest of this section, I will explain in moredetail what the problem is, using our original handy illustration borrowedfrom Skyrms [74].Consider one of the simplest evidential questions that one might ask:is e evidence for or against h? Imagine that we want to know whetherthe fact that John smokes is evidence for or against the proposition thatJohn has cancer (we do not know where John lives). This is precisely thetype of query that the simple Bayesian theory of qualitative incrementalconfirmation claims it is is suitably equipped to deal with. On the Bayesianconstrual e is evidence for h just in case pr(h ∣ e) > pr(h). Probability theoryallows us to express the incremental condition in terms of the equivalent“probative” condition pr(h ∣ e) > pr(h ∣ ¬e). It follows apparently that froma simple Bayesian point of view John smokes disconfirms John has cancer.And yet to licence this conclusion, we have to show that the assessmentrespects the total evidence requirement; i.e. that the knowledge reflectedin this assessment of evidential force isn’t just a proper subset of what weknow.As a way of narrowing down the options, suppose we ask not whether allof our knowledge is included, but rather what our total knowledge ought toinclude. On the one hand, it should clearly include the Simpson’s paradox-ical data provided by the observational study. But it seems that our totalknowledge ought to include facts beyond probabilities (if available), for in-stance the key facts that smoking causally promotes cancer, and that thenegative correlation between smoking and cancer is spurious i.e. the factthat smoking does not causally inhibit cancer but it is causally promoted bya common causal factor, namely countryside location. This, to repeat, is notinformation that we can represent using a probability function (alone). It64requires a representation of the causal structure of the problem, for examplethe familiar causal diagram of a directed acyclic graph.But suppose for the moment that causal knowledge is not available.That is, suppose that one does not have enough information about the worldto distinguish between the Markov-equivalent graphs 3.1 and 3.2. For allone knows, it could be that heavy smoking causes people to take refugein the countryside (graph 3.2), instead of occurring as a consequence ofliving in the countryside. Then, if one does not possess sufficient causalinformation, it seems correct to say that the information that John smokesis evidence against cancer. As remarked, this accords well enough withthe Bayesian theory, since (given that one’s total evidence does not includecausal knowledge), the determining factor in the force of evidence is goingto be the overall negative relevance of smoking to cancer, i.e the fact thatpr(Y = cancer ∣X = smoke) < pr(Y = cancer ∣X = ¬smoke).To have a name for it, I will call this the news value sense of evidence.It represents the force of evidence, as I have explained, as assessed by us,without taking the causal information into account, either because we don’thave it, or because we have elected to ignore it. It is called “news value” (byanalogy with the rift in decision theory between maximizing expected utilityin the news value sense versus maximizing it in the causal sense) because ittells us about overall relevance, which goes by simple conditional probability.But there is another way in which the terminology is justified. The fact thatJohn smokes literally gives us “news” in the sense of new information aboutwhere John lives, since smoking is probabilistically positively relevant tolocation. This news-item in turn updates our probability that John lives inthe countryside (increases it), which then factors into an updated probabilityfor John’s cancer (decreases it)47. This local updating of the probabilitydistribution over a causally spurious, interposed variable like “location” (Z)does not happen when evidence is assessed in the “causal sense”, as I willexplain below. (It is in fact the key difference between them.)47Of course, from the point of view of the probability function, this happens all at once.My argument merely breaks it down into stages. We can think of information as flowingthrough the system, achieving various effects as it does so.65Now suppose that our total evidence includes specific causal knowledge,so that our assessment of the force of evidence is carried out against abackground corpus which includes the information represented in the causalgraph 3.1, by means of which we effectively embed the observed data inthe real world. Note that if we keep fixed where John lives, whether it’s inthe countryside or the city, we get a confirmatory, supporting effect fromour evidence, since both conditional on city and conditional on country, theprobability that John has cancer is increased by learning that John smokes.So it would seem that from a point of view which includes causal knowledge,our evidence is not evidence against, but rather evidence for the hypothesisof interest. I will call this evidence in a causal sense or causal evidence.It seems to fit the intuitive reasoning that if we knew hypothetically thatJohn lived in the city (countryside), we would not judge the informationthat he smokes to be evidence against, but evidence for cancer. Thus, in thenews value sense, the proposition that John smokes is evidence against thehypothesis that he has cancer, but in the causal sense, it is evidence for it.Consider a few more Simpson’s paradoxical examples, to clarify the dis-tinction. Consider (1) studying arithmetic (we may suppose) increases theprobability of passing the exam, but only if we keep the student’s age con-stant. Otherwise, studying arithmetic is indicative of being very young,hence indicative of comparatively poorer scores. (2) Undergoing chemother-apy raises the expectation that the patient will live to see another solstice,but only if we keep the disease fixed; otherwise, undergoing chemotherapyindicates that the patient is seriously ill, and hence won’t fare so well inthe near future. In one way, we want to say that arithmetic is evidencefor passing the exam, by taking into account all available causal knowledge,in particular the information that studying promotes successful academicendeavours. This evaluation of evidential force represents an estimate ofcausal evidential effect, which traces the direct causal effect of the behaviourvariable on the outcome variable. But we may also want to account forthe evidential force of studying arithmetic or undergoing chemotherapy interms of being evidence for an independent causal factor (common cause),namely for age in (1) and for disease in (2). If we combine the evidence that66e.g. studying arithmetic provides for being young, with the further evidencethat being young provides against passing with flying colours, we get overallevidential relevance: the covering of all relations of probabilistic relevance,including those deemed causally spurious.What emerges from these examples is that whether we want to say thatthe evidence supports or does not support the hypothesis depends on whatwe keep fixed in assessing its evidential relevance. When we keep fixed thevalue of a variable, we do not use the information provided by evidence toupdate the probability distribution over it. Keeping fixed means breakingthe probabilistic dependence between for example, smoking and location.The point can be helpfully made using graphical intuition. Take again graph3.1. Suppose we ask: what in the real world explains the fact that theoverall probabilistic relevance of X on Y is negative? The key causal facts“in the world” are that smoking is an exacerbating risk for cancer, and lowpollution (living in the countryside) is an inhibiting factor. How could thesefacts be made compatible with the observed data? How could we reconcilethe patterns in the world, with the patterns in the data? Intuitively it hasto be the case that smoking and low pollution are probabilistically positivelyrelated; i.e., that incurring the exacerbating factor makes it more likely toincur the inhibiting factor, and the latter overall outweighs the former inits effect on health. Otherwise the data could not be Simpson’s paradoxical(in the way it is, negative relevance reversing to positive, relative to Z).But note that what could account for this feature of dependence among Xand Z “in the real world” is (with equal convenience) either the fact thatX causes Z, or the fact that Z causes X. As far as overall relevance isconcerned, these two causal possibilities are equivalent. We could have usedthe Markov equivalent graph 3.2 and reached the same conclusion aboutthe overall force of evidence. News value evidence goes happily along thecausally spurious path X ← Z → Y , because it does not distinguish betweenit and the alternative path X → Z → Y .But now suppose that we want to “keep fixed” location, as when we wantto keep fixed John’s actual location in order to ascertain what evidentialrelevance in a causal sense the fact that he smokes bears upon the hypothesis67●●●ZXY77//Figure 3.3: The irrelevance of smoking.that he has cancer. When we keep Z fixed in an epistemic or evidentialsense, we do not allow information about X to pass locally onto Z. Thusto keep fixed the location, we have to treat X and Z as though they wereprobabilistically independent of each other. In other words, as though Xand Y were related by graph 3.3, instead of graph 3.1, and by learningabout X we were in fact instantiating X’s value in 3.3, not in 3.1. Ofcourse, such a graph is literally impossible from the point of view of ourexisting Simpson’s paradoxical probability function, since our data deemsX and Z to be probabilistically dependent, whilst X and Z are independentin graph 3.348. To perform the operation of keeping Z fixed in relationto learning about X, we must perform a transformation of our existingprobability measure. The resulting probability measure must be compatiblewith the idea of learning X when the process is as in graph 3.3.3.3 Causal and news value degree of beliefThe proposal to distinguish between a causal and news value senses of ev-idence needs to be fleshed out. I argue in this section that that we shouldthink of evidence in the causal sense as a relation tracking causal probabil-48In graph 3.3, Z and X are independent of each other, as can be seen from the factthat their sole connection is via the “collider” node Y . They will remain independent fromeach other, as long as Y is not conditioned upon.68ities, whereas evidence in the news value sense should be thought of as arelation tracking overall probabilistic relevance. To make sure the crucialnotions of a causal probability and a news value probability have more thanmerely a technical meaning, it will be important to give them a clear epis-temic meaning as well, as degree of belief in some appropriate sense. There-fore, informally, I will understand a causal degree of belief in Y given X tobe the degree to which a rational agent ought to be confident that Y maybe brought about as a consequence of bringing about X. Correspondingly,I will understand a news value degree of belief to be the degree to whichinformation about X provides news or information about Y . Formally, I willunderstand a causal degree of belief to be the probabilistic measure of theagent’s uncertainty in Y under an intervention on X, whereas the news valuedegree of belief represents the agent’s uncertainty in Y probabilistically aswell, but from the point of view of conditioning on X.The idea that a probabilistic epistemology should implement a distinc-tion between a causal and a non-causal sense of degree of belief does notseem to have very many proponents in the literature so far. The closestantecedent source in the literature that refers explicitly to an epistemic di-mension in the notion of a probability under an intervention seems to befound in an article by Menzies and Price [55]. In the course of attempt-ing to characterize the agency theory of causation, the authors distinguishbetween ordinary conditional probabilities and what they refer to as agentprobabilities. As it turns out, agent probabilities appear to represent anearly prototype of probabilities under intervention, which I have labelledcausal degrees of belief. As the authors write,Agent probabilities are to be thought of as conditional probabil-ities assessed from the agent’s perspective under the suppositionthat the antecedent condition is realized ab initio, as a free actof the agent concerned. Thus the agent probability that oneshould ascribe to B conditional on A is the probability that B69would hold were one to choose to realize A49.I discuss the notion of agent probability in the next section. But themain distinction that I seek to draw between rational degrees of confidencein view of manipulated variables, and rational degrees of confidence in viewof conditioned upon variables, is best expressed in the framework of causalanalysis developed by Pearl50.In Pearl’s work, the distinction between evidential and causal reasoningis encoded in the distinct operations of do and see conditioning, correspond-ing on an intuitive level to the distinction between observing an event, andacting to bring about an event. In probability calculus, the see operationis discharged by ordinary conditioning; thus the expression pr(Y ∣ see(X))stands for the common place conditioning operation pr(Y ∣X), referring tothe probability of Y given X. The do operator captures probability changeunder an external intervention, which forces a variable to take a certainvalue; correspondingly, the expression pr(Y ∣ do(X) is the probability of Ygiven that we intervene on X.Accordingly, I will understand causal and news value degree of belief asfollows:(causal degree of belief) degree to which a rational agent is confident thatY takes value y, given that X is subjected to an intervention which forcesit to take value x.(news value degree of belief) degree to which a rational agent is confidentthat Y takes value y, given that X is observed to take value x.Informal illustration 1 To illustrate informally the distinction betweenrational degrees of belief in view of manipulated variables, and rational de-grees of belief in view of observed variables, consider the following scenario.49Menzies and Price [55], page 190.50See Pearl [62].70Suppose that a rational agent is surveying a population in which foot bind-ing is practiced to lessen the size of feet of adult women51. It is reasonableto suppose that in the population at large, having small feet is positivelycorrelated with having small hands. Thus observing that a woman has smallfeet is a reasonable basis for inferring that her hands are quite likely smallas well. Yet binding a woman’s feet is not a reasonable basis for drawingthe same conclusion, because binding the feet has no effect on the size ofthe person’s hands. We may therefore take it that the agent’s probabilitythat an individual woman has small hands given that there has been an“intervention” to control the size of her feet is just the probability that arandomly selected woman in the biological population of reference has smallhands i.e. the average probability for having small hands.Informal illustration 2 Here is another illustration of the distinction be-tween causal and news value degree of belief. A rational agent has observedthat there is a positive correlation between wearing galoshes on his feet, andhaving a headache. The reason is that the agent tends to forget to removehis galoshes after an epic night out, so the next morning he generally wakesup both fully shod and feeling a bit delicate. What should be his degree ofbelief in the proposition I have a headache in view of the proposition I amwearing my galoshes? There is a causal sense in which we may address thisquestion, and a news value sense. If the agent peeps out from under thecovers and notices that he is wearing his galoshes, his degree of belief in theproposition that he has a headache ought to be reasonably high. This is be-cause an observation of galoshes on his feet gives him information about thecommon cause (epic drinking); a cause which also induces a headache themorning after the night before. On the other hand, if the truth of the propo-sition I am wearing my galoshes comes about as a result of the agent pullinghis galoshes on, he should be no more confident that he has a headache thanhe was before he took note of this information.51The example is borrowed from Lindley [48].71Informal illustration 3 Here is a third illustration to make sense of thedistinction. As far as Mrs Ramsey is concerned, receiving flowers is highlycorrelated with a happy marriage. After all being surrounded by flowersmakes her very happy and forgiving, which in turn oils the wheels of hermarriage and makes everything run smoothly. But Mrs Ramsey also knowsthat flowers tend to be correlated with spousal guilt insofar as the feelingthat one has wronged ones’s spouse in a deeper than usual way prompts ges-tures of redress or atonement. Spousal guilt is of course a harbinger of a badmarriage. It makes people behave in less than ideal ways which eventuallybrings everything down upon them. Tracing this pattern leads Mrs Ram-sey to judge correspondingly that receiving flowers is highly correlated witha bad marriage. What should be Mrs Ramsey’s degree of confidence thather marriage is going well on the basis that she regularly receives flowers?Intuitively, it seems that whether she ought to be pessimistic or optimisticabout her marriage depends on how it comes to be true that she regularlyreceives flowers. Suppose on the one hand that she regularly receives floraltributes because she has them sent to herself every week. Then it seemsshe ought to be fairly confident the marriage is going well, since whateverelse is going on in her marriage, at least the flowers help her disposition toweather a crisis. But suppose that it is Mr Ramsey who has the flowerssent to her from his office via courier every week. Then Mrs Ramsey oughtto be rather less optimistic, which ought to be reflected in her degree ofbelief. The flowers-to-self scenario shows what her causal degree of beliefought to be. The flowers-by-courier scenario shows what her news valuedegree of belief ought to be. In the former, Mrs Ramsey is a rational agentcollecting the information that an active intervention in the causal systemprovides for her. In the latter, Mrs Ramsey is a rational agent collecting theinformation that a passive observation in the causal system provides for her.Examples in which it is appropriate to distinguish between rational beliefunder an intervention and rational belief under an observation are easy togenerate. And yet the point that arises from considering these illustrationsmore carefully is stronger than merely noticing that one can describe belief72causally or in the news value sense. Rather, the stronger point ought to bethat ignoring the distinction between belief under an intervention and beliefunder an observation leads us into bad error. Suppose that the agent wholearns that he has galoshes on his feet this morning ignores the fact that thisinformation was generated by the action of putting on his galoshes. Then hemistakenly infers that in all likelihood he has a headache. Or suppose thatthe ethnologist who is exploring the phenomenon of foot binding neglects tocheck whether the examined feet are small in consequence of foot binding.Then he may well draw a mistaken inference about the size of the woman’shands, an inference moreover completely unwarranted if in fact the size isthe result of a modification of the natural proportions of the feet.A question hardly ever asked in epistemology is when is it appropriateto incorporate a piece of information by conditioning, and when is it ap-propriate to treat it as an intervention in the system? Quite clearly, thefact that the information considered is true is not sufficient to determine ananswer to this question. In all three illustrations there was a true piece ofinformation such that when we imagined that the information was learnedby observation, we were compelled to appeal to conditioning, and yet whenwe imagined that the information was generated by an action, it made senseto treat it as an intervention in the system. It follows that the way to dealwith information depends on a parameter that shows how the informationwas learned, in consequence of an action, or passively. (a) We should updateour probabilities by conditioning when we learn the content by gathering apassive observation. (b) We should update our probabilities by treating itas an intervention when we generate the information ourselves as a result ofhaving acted in the system.3.3.1 Comparison with Menzies and Price [55]Menzies and Price [55] offer an analysis of causation in terms of manipula-tion or intervention, which is philosophically expressed in the idea of “freeagency”. According to their analysis, an event A is a cause of B just in casebringing about the occurrence of A constitutes an effective means by which73a free agent could bring it about that B. It is clear from their discussion thatthe notion of a free act to effect a change in B by means of manipulatingA qua cause of B is a precursor of the formal notion of an intervention in acausal system e.g. in the sense of structural equations modelling; or in thesense of an intervention in a causal graph. For example, as the earlier quoteevidenced, they write that the perspective from which causation ought to beexamined involves “the supposition that the antecedent condition is realizedab initio, as a free act of the agent concerned”. In a similar vein, they writeof a manipulation of one variable to bring about a change in another as“creating a new history for B, one that would ultimately originate in one’sown decision, freely made".So why should we not understand causal probabilities or causal degreesof belief in terms of the notion of agent probability as intended by Menziesand Price? If the point of invoking causal probabilities is to get rid of spu-rious correlations, Menzies and Price suggest, agent probabilities are fit forthe task:Agent probabilities fail to generate spurious correlations becausethey abstract away from the evidential impact of an event, in ef-fect by creating for it an independent causal history. For exam-ple, in enquiring whether one’s manipulation of an effect B wouldnot affect the probability of its normal cause A, one imagines anew history for B, a history that would ultimately originate inits own decision, freely made52.It would be nice if this theory of causation in terms of a primitive notionof free agency worked. But it does not. On the one hand, I am happy togrant that the difference between a causal understanding of the probabilityof Y under an intervention of X and the news value probability of Y underconditioning on X is underwritten by the difference in truth value betweenthe following pair of counterfactual statements, assuming X to be an an-52 Menzies and Price [55] page 191.74tecedent cause of Y :(C1) If X were intervened upon, the value of Y would change.This counterfactual is true. More specifically, we could say that if Xwere forced to take value x, then Y would take value y with a proba-bility determined by the magnitude of this effect; equivalently, that thepost-intervention probability over the range of values for Y if X is forcedto take value x is distributed in accordance with the quantity defined bypr(Y ∣ Do(X = x)). Compare however the false and backtracking counter-factual:(C2) If Y were intervened upon, then the value of X would change.It would seem that the idea a free act guarantees us a “new history” getsthis exactly right. Take the backtracking and false counterfactual (C2) first.If Y has a new history, then this new history becomes part of the similarityor closeness relation by which the semantics evaluates the conditional. Butthe new history cuts Y off from X. Hence in every closest possible world inwhich Y obtains the intended value, Y is cut off from X, and the value of Xdoes not change. By similar reasoning (C1) is true. In every closest possibleworld in which X has its intended value, Y ’s value changes as well. It makesno difference to the causal dependence of Y on X that we guarantee a newhistory for X.Thus, if the notion of agent probability canvassed by Menzies and Pricehas its motivation in capturing the difference in truth value between C1 andC2, that is the correct motivation to have. Nevertheless, I think the appealto the notion of a free act is suggestive but misleading. In my view, there aretwo basic objections that preclude taking the notion of a “free act” as basicto the understanding of an intervention. (1) There appears to be no basisin any existing philosophical account of free agency that would preclude thefollowing situation from arising:75(action-action spurious correlation) Suppose that a variable X has nocausal bearing on a distinct variable Y , nor Y on X. Nevertheless, a freeact modification of X is contingently correlated with a free act modificationof Y , and therefore X and Y come out highly (yet spuriously!) correlated.For example, if intervening by a “free act” to change a woman’s footsize by foot binding produces artificially small feet, then one should expectto find no correlation between the act of binding the feet and the propertyof having small hands, assuming that foot binding is practiced randomlywith respect to natural (inborn or inherited) size of foot. But if the free actof binding someone’s feet is correlated in the population with the free actof binding the person’s hands as well, for example for aesthetic or culturalreasons that dictate that small appendages are desirable both on the trunkand below it, then it will look as though intervening to lessen the size ofthe feet has a causal repercussion on the size of hands, and conversely. Butno existing theory of a free act will tell us that we cannot freely executeacts A and B simply because A and B are carried out conjointly. Hence nounderstanding of an intervention as an ordinary free act will preclude thistype of confounding case from arising.(2)In my view, the account gets the connection between free agency andintervention exactly backwards. Rather than understanding intervention interms of a free act, we should understand what it is that constitutes a freeact in terms of what the free act enables us to do, namely intervene uponthe world. Using the notion of an intervention to explicate the notion ofa free act is far more sensible than taking the rather opaque construct ofa free act and attempting to cash out intervention in terms of it. For thisreason, I suggest that the primary notion in the analysis ought to be thatof an intervention, acknowledging that it is the aim of a free action to beas close to the ideal of an intervention as possible. That is, the more afree act upon the world manages to emulate the characteristic notion of anintervention, the more suitable the description of it as “free” ought to be.Thus, my recommended view of free agency is aspirational. A free agentshould aspire to treat his own actions as much as possible as interventions76in the world. That is, he should not conceive of his actions as caused, butshould think of them as actual causes of change in the world.Overall I should say that although I think that Menzies and Price [55] areoverly enthusiastic about the possibility of reducing causation to a primitive(experiential) notion of free agency, I think that the basic idea of treatingcausation in terms of intervention or manipulation is definitely sound. AsMenzies and Price recognize, causation is a notion with an essential inter-ventionist component. To enquire about cause in effect is to enquire how amodification of the cause brings about a modification of the effect; how weunderstand the key concept of a modification is indeed crucial to the theoryof causation.3.3.2 Prediction and explanation in Simpson’s paradoxIn this section I want to fill in a broader picture by linking the notionof news value and causal evidence to the affiliated notions of predictionand explanation. Broadly speaking, I will take it that the causal and thenews value senses of evidential relevance track distinct aspects of a morebasic and fundamental notion of informational relevance. Information aboutX can be relevant to Y in two distinct ways (a) it can be predictivelyrelevant to Y (b) it can be explanatorily relevant to Y . Predictive relevanceis a measure of how and how much learning about X allows us to predictthe outcome in Y . Explanatory relevance provides an idea of how wellthe information about X explains the outcome in Y . Predictive relevanceis compatible with Simpson’s paradox. Explanatory relevance isn’t. Thepoint hinges on the knowledge representation of the partition factor Z. Itis compatible with the assertion that in the absence of knowledge about Z,information about X predictively allows us to infer a certain value y for Y ,that knowing the value of Z, information about X permits the inferenceof a totally different value y∗ for Y . Predictive inference is sensitive toknowledge. But explanatory inference is not. If the facts about X arecapable of explaining the facts about Y under conditions of ignorance aboutZ, they are capable of explaining them under conditions of knowledge as77well (and conversely).The point can be illustrated in terms of the earlier example from Skyrms.Consider two causally identical models, both centred on John, the subjectof our inquiry. In both of these John smokes, lives in the country, and diesfrom cancer. But in one model we know his location, and in the other modelwe don’t. Is it permissible to claim that smoking explains John’s cancer inone model but not in the other? Clearly not. Explanatory power is tiedto causal structure, not to our knowledge of factor Z (i.e. location). Bycontrast, the predictive power of the information that John smokes is highlydependent on the state of knowledge. Estimating predictive risks of cancerin a model enriched by knowledge of Z provides lower and upper boundson the predictive risk of cancer in the model without knowledge of Z. Forexample, insurance rates for John’s life should be lower in the model in whichwe don’t know his location than in a model in which we know he lives in thecity, and yet higher than in the actual model in which we know he lives inthe country. There is thus a clear distinction between the distinct roles ofpredictive and explanatory processing of information in terms of how wellthey can accommodate Simpson’s paradox.A relevant question to ask in this context is as follows. Exactly howdoes the structure of causation interact with the structure of empirical ex-planation? The literature on scientific explanation shows that the aims of atheory of explanation are well served by appealing to causal knowledge. Forexample, both the phenomenon of explanatory asymmetry53 and the prob-lem of explanatory irrelevance54 are adequately dealt with by referring tothe directionality of causation. Whatever the complexities in such a ques-tion may be, we may take the following principles as basic to the interplaybetween causal knowledge and explanatory structure: (directionality) thedirection of causation determines the direction of explanation; and (limits)causal irrelevance entails explanatory irrelevance. By the first principleof directionality, if c causes e, then facts about c are capable of explainingfacts about e, but not conversely. By (limits), if c is causally irrelevant to e,53 See for example van Fraassen [80], Salmon [67], and Salmon [69].54See for example Salmon [66] and Kyburg [42].78then in explaining e it is irrelevant to invoke facts about c. Together thesetwo principles express the idea that causal knowledge places strict struc-tural constraints on explanation: essentially, causal knowledge determineswhich variables are explanatorily relevant to which, and which are irrelevantaltogether for the purpose of explanation.If the previous remarks are correct, then a new layer of the distinction forwhich I have argued in this chapter has come to light. We may take it thatevery Simpson’s paradoxical causal representation implicitly determines ahighly structured, strict network of explanatory relevance, which is entirelydistinct from the news value or predictive representation of evidential rele-vance55. Causal and news value evidential relevance allow us to negotiatethe structures accordingly. This is yet another reason to take seriously theidea that a theory of evidence ought to make room for both.3.4 ConclusionCausation and evidence are concepts with fundamental ties to probability.The consequences of Simpson’s paradox for the probabilistic notion of causa-tion have been extensively discussed. The range and variety of proposals tomodify the simple theory of probabilistic causation proposed by Cartwright,Skyrms, Eells, Otte, and many others bear witness to the interest that Simp-son’s paradox has evoked, and the seriousness with which it has been treatedin the philosophy of causality. In this chapter, I argued that the Bayesiantheory of evidence, which to date remains the most popular probabilistictreatment of evidence needs to come to terms with Simpson’s paradox.55There is on the whole very little discussion of Simpson’s paradox in the context ofscientific explanation. van Fraassen [80] briefly discusses Simpson’s paradox in relationto “probability-lowering” explanations. See van Fraassen [80], pages 148-150. Describinga similar model, he claims that a proposition “Tom smokes” can favour the topic “Tomhas heart disease” in a ”straightforward (though derivative) sense: for as we would say it,the odds of heart disease increases with smoking regardless of whether he is an excerciseror a non-exerciser, and he must be one or the other” (page 149). The derivative butstraightforward favouring relation is what I have explicitly identified here as causal.79Chapter 4Simpson’s Paradox Unified4.1 IntroductionIs Simpson’s paradox a unified paradox? Or is there a distinct non-causalform of paradox, pertaining to reasoning about ratios, as opposed to reason-ing about causation? To anyone who studies the topic, it is abundantly clearthat Simpson’s paradox involves a form of inference gone wrong. An instanceof Simpson’s paradox is an instance of apparently sound reasoning leadingto a counterintuitive conclusion. But to what extent is Simpson’s paradoxa form of counterintuitive inference arising for reasoning about proportion-ality as opposed to reasoning about causation? According to the critics ofthe causal view, it is either the case that (a) Simpson’s paradox arises pri-marily for reasoning about ratios or (b) Simpson’s paradox arises equally forreasoning about ratios and for reasoning about causation, but in differentforms. If these views are correct, it seems that we should talk about differentvarieties of Simpson’s paradox, or multiple forms of paradox, as opposed toa single unified form to make sense of. Something must be said to addressthe threat of disunity in the genus.Here is an example attributed to the mathematician John G. Bennett56,which draws attention to the fact that when ratios are aggregated our nat-ural expectations may be aggrieved:56Bandyopadhyay et al. [5]80Suppose we have two bags of marbles, all of which are either bigor small, or red or blue. Suppose that in each bag, the proportionof big marbles that are red is bigger than the proportion of smallmarbles that are red. Now suppose we pour all the marbles fromboth bags into one box. Would be expect the proportion of bigmarbles in the box that are red to be greater than the proportionof small marbles that are red? Most of us would be surprisedto find that our usual expectation is incorrect. The big ballsin bag one have a higher ratio of red to blue balls than do thesmall balls; the same is true about the ratio in bag two. Butconsidering all the balls together, the small balls have a higherratio of reds to blues than the bigs balls do.(5, page 194)Those who follow Bennett in assigning the source of our perplexity whendealing with Simpson’s paradox to our apparent inability to reason correctlyon the basis of assumptions involving proportions are likely to see this typeof example as a test case for the persuasiveness and cogency of the causalaccount. For example, Bandyopadhyay et al. [5] write:The premises that generate the paradox are non-causal in char-acter and a genuine logical inconsistency is at stake when a fullreconstruction of the paradox is carried out. (5, page 186)We argue that this is a case of SP since it has the same math-ematical structure as the type I version of Simpson’s paradox.There are no causal assumptions made in this case, no possible“confounding”. But it still seems surprising. This is the pointof the test case. The proponents of the causal analysis of theparadox must argue either that this is not surprising or that itengages in causal reasoning even when the question presents us81with nothing causal. We find neither of these replies tenable.We believe the test case shows that at least sometimes there is apurely mathematical mistake about ratios that people custom-arily make.(5, page 194)Correlations are not causes, though correlations are (part of the)evidence for causes. But what is paradoxical in the SP case haslittle to do with these complexities; there is simply a mistakeninference about correlations, which are really just ratios.(5, page 195)Wherever they arise, paradoxes create a lingering sense of puzzlement.But the art of paradox involves being able to recognize and avoid unneces-sary puzzlement, in addition to understanding the nature of the paradox.Conceptual economy too dictates that we should avoid multiplying forms ofparadox unnecessarily. There is something inherently otiose in the very ideaof multiple forms of paradox. But if it is true that Simpson’s paradox arisesseparately for reasoning about ratios, or any other type of non-causal subjectmatter, then the various causal solutions that I proposed in this thesis don’treally seem like solutions at all. A solution needs to have some intrinsic levelof generality. But if the diversity of Simpson’s paradoxical phenomena forcesus to distinguish between contexts in which causal solutions are appropriate,and contexts in which causal solutions are not appropriate, then the causalapproach may seem ad hoc, even terminally ad hoc. Or it may seem thatit misses the mark altogether, for if vestiges of paradox remain when causalstructure and causal knowledge are taken out of the account, then the criticof the causal view is free to conclude more or less directly that the core ofthe paradox is simply not causal in any shape or form. Indeed, the viewthat at the heart of Simpson’s paradox lies not reasoning about causationbut reasoning about ratios is embraced by a variety of authors57. If the ratio57See Malinas and Bigelow [52] for a similar view, concluding that ultimately it is anempirical question why people find Simpson’s paradox puzzling.82view is correct, then it is a mistake to think that causation holds the key toSimpson’s paradox. The primary focus ought to be on non-causal forms ofreasoning, with causal forms coming in a distant second.To address this objection, in this chapter I will draw a distinction be-tween causal structure and causal reasoning. I will argue that even in sce-narios in which the formulation of the problem does not address a specificcausal structure, it is possible to reason about the problem as though it didhave a specific causal structure behind it. I call this causal reasoning indisguise. I think that it is possible to reason about apparently non-causalsubject matter in ways which simulate causal thinking. As I will explain, itis possible to view future knowledge, for example, as knowledge that arisesunder an intervention. This is tantamount to treating the model as if itwere causal, by breaking the probabilistic dependence between variables asthough we were in fact intervening to generate causal conclusions. But ifwe reason about future knowledge by treating it essentially as a form ofknowledge arising by intervention, it isn’t surprising that paradox beckonsanew. It’s to be expected. That is what the causal account predicts.4.2 Simpson’s paradox as a paradox about ratiosProponents of the causal view of the nature of Simpson’s paradox understandcausal knowledge and reasoning about causation to be intimately connected.The central claim in the causal account of the paradox is that causal knowl-edge serves as a means of separating admissible from inadmissible (goodand bad) inferences pertaining to causal structure, when the information isSimpson’s paradoxical at the level of pure probabilities. Primarily, causalknowledge serves as a constraint on legitimate causal inference carried out onthe basis of Simpson’s paradoxical probabilities. For instance, causal knowl-edge explains why it is legitimate to infer from (1) drug promotes recovery inmen and (2) drug promotes recovery in women to (3) drug promotes recoveryin the population as a whole, despite the fact that the probability of recoverymay be lower in the population as a whole if the drug is withheld than if itis administered. Causal knowledge also explains why it is not legitimate to83infer from (4) drug promotes recovery in the high blood pressure group and(5) drug promotes recovery in the low blood pressure group to (6) drug pro-motes recovery in the population as a whole, in a probabilistically identicalscenario in which as previously the probability of recovery is lower if the drugis administered than if it is withheld. [62]. That the first inference is validand that the second is not accords with natural intuition. Therefore thefact that causal knowledge serves to validate one and invalidate the otherdemonstrates the usefulness (and felicity) of causal argument to reasoningabout causation in Simpson’s paradoxical scenarios. Indeed it seems thatcausal argument is uniquely useful here, as it is extremely difficult if notimpossible to see how else (if not by means of causal argument) we couldlegitimize one inference and block the other.That there should exist a separate form of paradox which affects ourreasoning about proportionality seems to be a very distant if not downrightcontradictory afterthought in the causal account of the relevance of the para-dox to formal reasoning. Yet according to the opponents of the causal view,there exists a form of counterintuitive inference that arises about propor-tions or ratios which is on the face of it devoid of any causal significance,hence cannot be touched by the causal account. The corrupt inferences re-sult from confused thinking about ratios, in particular, about how ratiosare aggregated. It is a simple mathematical fact that if the ratio of a : a+ b is greater than the ratio of c : c + d, and the ratio of e : e + f isgreater than the ratio of g : g + h, it does not follow that the ratio of a+ e : a + b + e + f is greater than the ratio of c + g : c + d + g +h. Herein lies the significance of the paradox, according to the critics whochoose to emphasize proportionality and de-emphasize causal thinking. Theratio account of Simpson’s paradox proposes that the fundamental peril ofSimpson’s paradox involves ignoring this simple yet consequential fact aboutratios.It is worth remarking that on the ratio view, the counter-intuitivenessof Simpson’s paradox is akin to the “counter-intuitiveness” of some of theinferential properties of the incremental concept of confirmation, first ex-plored by Carnap in his Logical Foundations of Probability [1952]. See also84Salmon [68]. As these writers pointed out, evidential reasoning proceedingon the basis of conditional probabilities can produce mildly surprising con-clusions. For example, both conjoining and disjoining individual conditionalsuppositions can lead to unexpected reversals of probabilistic significance.Equally, conjoining and disjoining different propositions conditioned on thesame supposition can create reversals of probabilistic inequalities. Some-what disconcertingly, none of the following four basic inferences involvingthe conditional operator is valid:(I) pr(A ∣ B) > pr(A ∣ ¬B)pr(A ∣ C) > pr(A ∣ ¬C);therefore,pr(A ∣ B ∧C) > pr(A ∣ ¬(B ∧C))(II) pr(A ∣ B) > pr(A ∣ ¬B)pr(A ∣ C) > pr(A ∣ ¬C);therefore,pr(A ∣ B ∨C) > pr(A ∣ ¬(B ∨C))(III) pr(A ∣ C) > pr(A ∣ ¬C)pr(B ∣ C) > pr(B ∣ ¬C)therefore,pr(A ∧B ∣ C) > pr(A ∧B ∣ ¬C)(IV) pr(A ∣ C) > pr(A ∣ ¬C)pr(B ∣ C) > pr(B ∣ ¬C)therefore,pr(A ∨B ∣ C) > pr(A ∨B ∣ ¬C)If the ratio view is correct, then Simpson’s paradox belongs with demon-strably invalid inferences (I) - (IV). In the case of (I), for example, faultyintuitions about proportionality lead us to form the untutored expectationthat if two pieces of evidence individually confirm a hypothesis, then their85conjunction will also confirm it. That is not the case, as can be shown bymeans of a probabilistic model 58. The ratio view places Simpson’s para-dox in the same family of problems arising purely for confirmation, andconfirmation-like concepts. Next, I explain in more detail how this argu-ment works.4.2.1 The ratio argumentThe argument from ratio inequalities is a three-step argument. One arguesfrom (i) an indisputable fact about ratios to (ii) an indisputable fact aboutprobabilities to (iii) the conclusion that Simpson’s paradox is free of causalreasoning. The indisputable fact about ratios is the fact I highlighted above.To repeat, from (1) aa+b >cc+d and (2)ee+f >gg+h it does not follow that (3)a+ea+b+e+f >c+gc+d+g+h . Thus, correct reasoning about ratio should not endorsethe derivation of (3) from (1) and (2) in any context in which ratios are rep-resented numerically. Furthermore, the indisputable fact about ratios entailsan indisputable fact about probabilities, which is that if we let probabilitiesrepresent ratios, then the resulting probability function may well be Simp-son’s paradoxical. Suppose that probabilities are interpreted as ratios. Thenwe may define a pr over the variables X,Y,Z in accordance with table 4.1as follows. Let pr(X = x1, Y = y1, Z = z1) = a, pr(X = x1, Y = y2, Z = z1) = b,pr(X = x2, Y = y1, Z = z1) = c, pr(X = x2, Y = y2, Z = z1) = d, etc. Thenfrom (1) above it immediately follows that (1’) pr(Y = y1 ∣ X = x1, Z = z1)> pr(Y = y1 ∣ X = x2, Z = z1); similarly from (2) it immediately followsthat (2’) pr(Y = y1 ∣ X = x1, Z = z2) > pr(Y = y1 ∣ X = x2, Z = z2), andyet, if we arrange it so that (3) is false, it immediately follows that (3’)pr(Y = y1 ∣X = x1) = ∑z pr(Y = y1 ∣X = x1, Z = z) < pr(Y = y1 ∣X = x2) =58For a concrete model, see for instance Salmon [68] pages 14-17. Salmon counts thefailure of the conditional operator . ∣ . to combine nicely with the the boolean operations∧ and ∨ both in the right and in the left argument place to be a serious worry for theincremental concept of confirmation. It is hard to agree with him about the seriousnessof the worry. The worry should concern the validity of our intuitions, not the notion ofconfirmation. The same observation goes for the argument establishing the weaker pointthe incremental evidential relation e confirms h is neither monotone nor anti-tone in eitherthe first or the second argument position. What the counter-models show is the naivetyof the intuitions, not the deficiency of the concept of incremental confirmation.86z1y1 y2z2y1 y2y1z1 (%) z2(%)y1 (%)x1x2a bc de fg haa+bee+fcc+dgg+ha+ea+b+e+fc+gc+d+g+hTable 4.1: Ratio to probabilities.∑z pr(Y = y1 ∣ X = x2, Z = z). In other words, representing ratios as prob-abilities, or interpreting probabilities as ratios, which comes to the samething, immediately yields a pr which is Simpson’s paradoxical insofar as itrespects the classic SP-inequalities or so-called boundary conditions in Blyth[8].The next step to the conclusion that Simpson’s paradox can be generatedmerely on the basis of ratio reasoning and does not require causal knowledgeis more elusive. But before we address it, it might be useful to have aconcrete model. Recall the example attributed to Bennett with which webegan. There is an obvious way to translate Bennett’s scenario into aninterpretation of the schematic data in Table 4.1. Let X refer to size, Yrefer to colour, and Z refer to bag, and let the values of these variables be theobvious ones, namely x1 = large, y1 = red, z1 = bag 1, etc. For some actualnumbers, take the interpreted data set in Table 4.2. It can be quickly verifiedthat the model respects Bennett’s illustration. In bag 1, the large ballspredominate over the small balls in proportion three to one, a proportionwhich is exactly reversed in bag 2, where small balls outnumber the large tothe same fractional extent. In the first bag, red balls are better distributedamong the large than among the small by a ten percent difference. In thesecond bag, red balls are also better distributed among the large than amongthe small by a ten percent differential. But in the mixture the representationof red balls predominates by ten percent among the small rather the largeballs. Contrary to what unexamined intuition might lead us to expect, theratio difference is flipped on its head.Here’s another example that works in a similar way to Bennett’s, inwhich the aggregation isn’t the physical pooling together of entities, but the87Bag1Red BlueBag2Red BlueRedBag1(%)Bag2(%)Mixed red (%)LargeSmall12 183 78 221 940 8030 705060Table 4.2: Bennett distribution.Study1HeadsTailsStudy2HeadsTailsHeadsStudy1(%)Study2(%)Mixed heads (%)Coin1Coin212 183 78 221 940 8030 705060Table 4.3: Aggregation of data.statistical pooling together of data points.Bias aggregation Two studies have been conducted to determine the bi-ases of a pair of coins, Coin 1 and Coin 2. The first study focussed on Coin1, tossing substantively more of Coin1 in proportion to Coin2. The oppositewas true of the second study. Their individual findings were then tabled sep-arately and also aggregately, as in Table 4.3. It would appear that accordingto the first study, Coin 1 is more likely to produce a Heads outcome thanCoin2. The same is true of the second study. Its findings also show thatCoin1 is more likely to land Heads than Coin2. But if we aggregate theirdata points, we conclude that on the whole, the total information showsthat Coin 2 is more likely to land Heads than Coin1. The bias to headsconclusion is reversed when the data from one study is joined to the datafrom the other.In both the box example and the bias example, “natural” expectationabout how ratios ought to be maintained across aggregation is violated.Would anybody take this violation of intuition to be the paradox in Simp-son’s paradox? Some authors do seem to think that to generate an enduringsense of paradox it is sufficient to violate expectations about how ratiosaggregate. But I doubt that they are correct. Note that the response to88the violation of the intuition about aggregation of ratios is to give up theintuition without further ado. But if the response to the violation of theintuition is simply to give up the intuition, several things may be called intoquestion. First in line is the validity of the intuition. The fact that we arehappy to retract the intuition as soon as trouble arises shows the intuitionto have been superficial. Then, there is the question of how deep paradoxreally runs. For a genuine paradox to arise, a genuine intuition must appearto be violated. Since that’s far from being the case here, one can take thereal point to be that there is no separate form of Simpson’s paradox arisingjust for reasoning about ratios, merely the appearance of one.But perhaps that’s giving the argument short shrift. To give the argu-ment from ratio aggregation a better foothold, suppose we set aside theseconsiderations and agree that (a) there is no causal structure in the mixedbags scenario and (b) it’s a moot point that a genuine intuition has beenviolated. For now, I want to elaborate Bennett’s example to give it as muchscope as possible for generating a sense of “paradox”, or of intuition underfire. Indeed if we broaden our discussion we may find that the intuitionsunder fire are themselves more complex and broad than just the intuitionabout aggregation of ratios. Consider Bennett’s illustration as a processrather than a static scenario, involving a rational agent reasoning aboutlikelihood on the basis of the data available to him. Imagine that:Dynamic Bennett A ball has been drawn at random from the mixtureof balls in the box in accordance with the distribution in Table 4.2. Letpr represent rational degrees of belief, as opposed to objective probabilities,and suppose that we think of a time t1 and a time t2 as before and afterthe agent comes to know whether in fact the ball came from bag one or bagtwo. We may assume that it is possible to trace back original provenance insome way or another, for example by means of a tag or mark showing if itcame from bag one or two, and that this information is both revealed to theagent as a matter of necessity at the interim moment, and expected to berevealed by the agent at the earlier time, though of course the agent doesn’tknow the contents of what will be revealed to him in due course.89Suppose that before the coming-to-know moment (the moment of truth)the agent compares his conditional probabilities and discovers that he thinksis more likely that the ball is red if it is small than if it is large. A quickglance at his subjective probabilities shows that he is entitled to do so, sincehis probabilities are such that pr(red ∣ small) = 0.6 which is greater thanhis degree of confidence pr(red ∣ large) = 0.5. But suppose that at t1 theagent is also considering what his future opinion will be, after the momentof truth, that is at time t2. Assuming he learns nothing else between t1 andt2, if at the interim moment he learns that the ball came from bag one, att2 he will think that the ball is more likely to be red if it is large ratherthan small. Similarly, if at the interim moment he learns that the ball camefrom bag two, he will also think that the ball is more likely to be red if itis large than if it is small. There is thus a perfect agreement between hisfuture selves at t2 and as to the comparative significance of size to colourand moreover a perfect disagreement between the future evaluative opinionand his current evaluative opinion.Isn’t this Simpson’s paradox? Doesn’t this show that we can generatecounterintuitive consequences simply on the basis (essentially) of ratios andproportionality? The way we extended Bennett’s example in terms of a mildinformational dynamics doesn’t seem to contribute causal structure to theproblem. So if the original formulation was free of causal structure, so is thedynamic version. But we’ve ended up with something that looks very muchlike paradox, albeit concerning comparative or conditional opinion.Note that the agent’s current ranking of conditional probabilities is notexpected to survive the increase in information represented by the comingto know moment. On the contrary, the agent fully expects that his futureranking after he comes to know will reverse of his current ranking. Onthe face of it this is counterintuitive enough, but we can make it morecounterintuitive by reflecting on the quality of the epistemic position beforeand after the moment of coming to know, and what this entails about thedifference of opinion. It seems that the agent at t1 is less knowledgeablethan his future selves at t2, no matter which one of these is his actual future90self 59. But the comparatively more ignorant agent at t1 knows the opinionof the more knowledgeable selves at t2. In general it seems irrational fora more ignorant agent to refuse to change his opinion to accord with theopinion of the better informed agent, when it is common knowledge whatthe better opinion is. We tend to think that better informed opinion hasepistemic authority over less well informed opinion, and that refusing toacknowledge epistemic authority is a kind of irrationality. But if the agentforms his evaluations according to his current opinion, then he must ignorethe argument from authority. Paradox appears to cut deep here, becauseit seems to force the agent into a type of irrationality that we would notnormally condone60.One might respond that the time t1 agent thinks it more likely that theball is red if it is small rather than large, because he has failed to includethe information that he will think the opposite at t2. If at t1 he expectsthat he will change his mind at t2, the agent must change his mind now.If he does not, it must be because he is ignoring the information that hewill change his mind. For no agent can rationally hold an opinion which isknown to diverge from his expected opinion. To do so would violate a formof the famous belief reflection principle, which requires the agent to reachan equilibrium between what he currently thinks and what he expects hewill think at a later time.Unfortunately, appealing to belief reflection will not help diffuse theparadox. For belief reflection applies to unconditional probabilities, not toconditional probabilities. Belief reflection in the form due to van Fraassenrequires the agent to set his unconditional probabilities to be the expectation59We shall ignore here a complexity which arises for systems of modal logic whichrepresent the logic of the modal operator as S5. In S5 it is impossible for a single agentto become more knowledgeable after aquiring new information without also becoming lessknowledgeable as well. For by negative introspection, an agent who doesn’t know whetherp knows that he doesn’t know whether p. But it is impossible for an agent who does knowwhether p to know that he doesn’t know whether p, for the latter proposition is false, andknowledge is factive. Hence there is something that the agent who doesn’t know whetherp knows, that he ceases to know once he learns whether p, namely a proposition whichexpresses his own ignorance [see 70, 71].60See Arntzenius [3] and Joyce [39]91of his unconditional probabilities at a later time. But it does not requirethe agent to set his conditional probabilities to be the expectation of hisconditional probability function at the later time.In more detail, in van Fraassen [81, page 244] van Fraassen proposes that(Reflection):(Reflection) P at (A ∣ Pat+x(A) = r) = r,is a principle of rationality, where “P at ” is the agent a’s credence functionat time t, x reflects the magnitude of the time lapse between t and thehypothetical future, and “P at+x(A) = r” is the proposition that at time t + xthe agent gives degree r of credence to proposition A.If we assume that the propositions of the form “P at+x(A) = r” form a par-tition at t + x, then by (Reflection) and probability calculus we can derive:P at (A) = ∑r Pat (A ∣ Pat+x(A) = r) × Pat (Pat+x(A) = r)= ∑r r × Pat (Pat+x(A) = r)= EtP at+x(A)This tells us that the agent’s time t unconditional probability for A isthe expectation of his time t + x unconditional probability for A. Note thatthe principle of reflection is used in step two of the derivation above.But although the principle constrains one’s unconditional probabilitiesto be the expectation of future unconditional probabilities, one’s conditionalprobability of A given B, namely,P at (A ∣ B) = ∑r Pat (A ∣ Pat+x(A) = r ∧B) × Pat (Pat+x(A) = r ∣ B)is not analogously constrained to be the expectation of future conditionalprobability.Why is that? Note that unless P at (Pat+x(A) = r ∣ B) = Pat (Pat+x(A) = r) i.e92unless the partition by future values for the probability of A is independent ofB, the conditional probability of A given B will not in general be equivalentto one’s expected future probability of A conditional on B, in other words to,∑r Pat (A ∣ Pat+x(A) = r ∧B) × Pat (Pat+x(A) = r) = EtPat+x(A ∣ B).Returning to our concrete example, it can be easily checked that theexpectation of the future probability of red given small is:Eat1Pat2(red ∣ small) =1/2 × 3/10 + 1/2 × 7/10 = 0.5;whilst the conditional probability of red given small is:P at1(red ∣ small) =1/4 × 3/10 + 3/4 × 7/10 =24 /40 = 0.6.On the other hand, the expectation of the future probability of red givenlarge is:Eat1Pat2(red ∣ large) = 1/2 × 4/10 + 1/2 × 8/10 = 0.6whilst the conditional probability of red given large is:P at1(red ∣ large) = 3/4 × 4/10 + 1/4 × 8/10 = 0.5Note that:(1) Eat1Pat2(red ∣ small) < Eat1Pat2(red ∣ large)Thus if the agent considers his future evaluations (opinions about eviden-tial force) by taking into account the expectation of his future probabilityconditional on small, he will come to the conclusion that red is expected tobe more likely given large than given small. But unfortunately, this does notchange the fact that:93(2) P at1(red ∣ small) > Pat1(red ∣ large)That is, on his current opinion, he thinks that red is more likely givensmall than given large. But belief reflection does not rule that the expecta-tion should replace the actual judgement. Belief reflection does not defusethe problem here, because it does not require an alignment between con-ditional probability, and the expectation of conditional probability. It onlyrequires the alignment where unconditional probability is concerned. Un-fortunately, the puzzle here seems to stem from the structure of conditionalprobabilities, not from unconditional probabilities. There is nothing in theproblem discussed so far that violates reflection. Thus we may reject thesuggestion that appealing to belief reflection solves the problem.The agent who thinks that red is more likely given small than given large,yet knows or fully expects that if he knew better, he would think that redis more likely given large than given small, does not violate reflection, butdoes seem to violate the principle of epistemic authority or reasoning frombetter knowledge. This apparent sort of violation seems substantive enoughto be taken seriously. Is this a non-causal version of Simpson’s paradox? Inthe following section, I will argue that it is not, because the reasoning thatled to this position is tacitly causal. It treats the future knowledge as a typeof intervention in the system. It is causal reasoning in disguise.4.2.2 Future knowledge as knowledge by interventionI grant that Bennett’s example does not require us to address causal struc-ture. We do not need to think about the causal relations obtaining betweenthe three properties of size, colour, and bag. We do not need to postulatethat size is a cause of colour, or that colour is a cause of size, or that bagcauses them both, even if it were meaningful to think of these propertiesas being part of a causal network, which isn’t immediately obvious either.I think we understand the problem without making causal assumptions orpostulating genuine causal relations to bolster the model. But I don’t think94that it follows from this that the reasoning we engage in is not causal, ifonly in the sense of simulating genuine causal thinking. On the contrary, Ithink the argument I gave above on behalf on Bennett’s ratio observationsends up looking a lot like causal thinking (I will sharpen this observationbelow). And so if there is a paradox, it’s because we’ve engaged in causalreasoning, not because the subject matter involves ratios.The first point I want to make is about the role played by the assump-tion that size and bag are probabilistically dependent. As it turns out,this assumption is crucial to the entire argument. Without assuming thatthe respective factors are probabilistically dependent, we cannot generateSimpson’s paradox.The point applies to Simpson’s paradox quite generally, so it is worthrunning quickly through the argument to see how it works. Take (P1) and(P2) below. Narrowly construed as a phenomenon about probabilistic in-equalities, Simpson’s paradox tells us that (P1) and (P2) do not entail theconclusion, that is, that the following form of argument is invalid.Canonically invalid pattern of inference involving probability inequalities:(P1) pr(A ∣ B,C) > pr(A ∣ ¬B,C)(P2) pr(A ∣ B,¬C) > pr(A ∣ ¬B,¬C)therefore,(Conclusion) pr(A ∣ B) > pr(A ∣ ¬B).is an invalid argument.If we translate this into statements about comparative probability, wecan say that from (i) Given the assumption C, A is more likely if B than ifnot B, and (ii) Given the assumption that ¬C, A is more likely if B thanif not B, we cannot infer that (iii) A is more likely if B than if not B. But95note that (P1) and (P2) fail to entail the conclusion of the inference onlyif we assume that B and C are probabilistically dependent. Otherwise theconclusion has to follow from the premises.Proof. The first step is to express the relevant conditional probabilitiesas weighted averages:pr(A ∣ B) = pr(A ∣ B,C) × pr(C ∣ B) + pr(A ∣ B,¬C) × pr(¬C ∣ B)pr(A ∣ ¬B) = pr(A ∣ ¬B,C)× pr(C ∣ ¬B)+ pr(A ∣ ¬B,¬C)× pr(¬C ∣ ¬B)Suppose that C is independent of B. Note that if C is independent ofB, then we immediately obtain:pr(C ∣ B) = pr(C ∣ ¬B)pr(¬C ∣ B) = pr(¬C ∣ ¬B).Next, by letting pr(C ∣ B) = x, from which it follows that pr(¬C ∣ B) =1 − x, and substituting in, we obtain:pr(A ∣ B) = pr(A ∣ B,C) × x + pr(A ∣ B,¬C) × (1 − x)pr(A ∣ ¬B) = pr(A ∣ ¬B,C) × x + pr(A ∣ ¬B,¬C) × (1 − x)But if pr(A ∣ B,C) > pr(A ∣ ¬B,C) and pr(A ∣ B,¬C) > pr(A ∣ ¬B,¬C)are both true, then pr(A ∣ B) > pr(A ∣ ¬B) is true as well, given that theexpressions flanking each inequality sign are multiplied by the same weight.Thus we have shown that under conditions of probabilistic independencebetween B and C, we cannot generate an appropriate counterexample, andthe argument from (P1) and (P2) to the conclusion is valid.Return now to the argument from future knowledge. Recall that be-tween t1 and t2 the agent learns whether the ball was drawn from the large96bag or the small bag. The argument runs as follows:No matter what the agent learns in the interim, at t2 he will thinkred more likely given large than given small. Therefore, at t1 he fully expectsthat at t2 he will think that red is more likely given large than given small.Hence, at t1 he fully expects that at t2 his future opinion will contradict hispresent opinion concerning whether red is more likely given large or givensmall.But how exactly does this argument represent the probabilistic relation-ship between size and the partitional factor bag? In actual fact, size and bagare related by a relation of probabilistic dependence, which according to theprevious argument, is a necessary condition for generating the Simpson’sparadoxical inequalities. But does this argument really represent size andbag as probabilistically dependent, or is the dependence tacitly severed?We know that severing the probabilistic connection between two vari-ables corresponds to an intervention, in a purely technical sense. If it is notsevered, then we expect the assumption of size to have an updating effecton the probability of bag. That is simply because if we assume that bag andsize are probabilistically dependent, it is incoherent to make an assumptionabout one without updating the probability of the other.So suppose that at t1 the agent assumes that the ball is large. In light ofthis assumed information, the probability that it came from bag one is 3/4.The agent knows that he will come to know that it came from bag one if andonly if it is true that it came from bag one. So his t1 probability that he willcome to know this fact is 3/4, and his t1 probability that he will come toknow that it came from the other bag is 1/4. If on the contrary he assumesthat the ball is small, then the probabilities of coming to know that theball came from bag one or two are 1/4 and 3/4 respectively. On the formerassumption, his expectation for his t2 probability for red is 3/4 × 0.4 + 1/4× 0.8 = 0.5. On the latter, his expectation for his t2 probability for red is1/4 × 0.3 + 3/4 × 0.7 = 0.6. So no matter what he learns before t2 it seemsthat for the agent red is expected to be more likely on the assumption that97its size is small rather than large. This is of course the opposite conclusionto what the argument from future knowledge intends us to think.The assumption that bag and size are probabilistically dependent meansthat the conditions or assumptions about size are treated in a particularway. Size is treated as informationally relevant concerning the various prob-abilities of coming to know facts about the partition. Now suppose thatinstead of thinking about coming to know facts about the partition as beingan event with a probability dependent upon size, we think of it as arising byan intervention. We just “surgically” make it true that the agent knows thefacts about the partition factor by time t2. Then the probability that theagent comes to know for example that the ball came from bag one dependson nothing else (not even size) except the act of intervening to make it truethat the agent knows this proposition. If we surgically make it true that theagent knows the facts, there is no need to update probabilities in light ofmaking an assumption about size. The dependence relation has been cut.Then indeed we can say that no matter what the agent learns by interven-tion before t2, he is expected to think red more likely given large than givensmall. Unlike the dependence reasoning, we do not export the size conditionacross the learning modality to update probabilities for coming to know atthe earlier t1.It would appear that where the argument from future knowledge goeswrong is by mixing up two different ways of thinking. When the agentconsiders his current opinion about comparative conditional rankings, he isassuming that size and bag are dependent on one another. But when heconsiders his future opinion in light of what he may come to know, he treatsthe knowledge he will come to have as though it arose by an interventionthat makes it true that he knows one fact rather than another by fiat orforce. It’s not surprising that conflict or dissonance arises. We cannotexpect the conclusions of two different ways of reasoning with probabilitiesnamely under an intervention and under an act of conditioning to necessarilyagree with each other. That won’t be the case in general, and it isn’t thecase in the scenario under discussion.984.2.3 Causal reasoning in disguiseAbove I distinguished between two ways of thinking about future knowledge,which I will call reasoning from probabilistic dependence, and reasoning fromknowledge as intervention. To recapitulate,Reasoning from probabilistic dependenceIf the agent assumes that the ball is large, then his probability that it isred is 0.5. If he assumes that the ball is small, his probability that it is red is0.6. Thus on his current probabilities, red is more probable on the assump-tion that it is small rather than large. On the other hand, if he assumesthat the ball is large, that means that it is three times more likely that theball came from first bag rather than the second. The opposite is true if heassumes that the ball is small. Therefore what he expects to learn in thefuture has a certain probability that depends on the assumption about size.Working this through, on the first assumption, his expected probability forred at t2 no matter what is learned in the interim is 0.5. Whereas on thesecond, his expected probability for red at t2 no matter what is learned is0.6. Thus on his current probabilities red is expected to be more probablegiven small rather than large no matter what is learned in the future. Futureand present opinion agree.Reasoning from knowledge as interventionWe take the learning modality no matter what is learned to refer notto ordinary knowledge but to knowledge arising by intervention. In thissense no matter what is learned means no matter what is learned by inter-vention. If we make it true by intervening that the agent knows that theball came from the first bag, then post-intervention red is more likely givenlarge than given small. Also if we make it true by intervening that the agentknows that the ball came from the second bag, then post-intervention red ismore likely given large than given small. If by t2 one or another interven-tion is assumed to have taken place, then by t2 he is expected to have anopinion which differs from his present opinion. Future and present opinion99disagree. But crucially, they disagree because his future opinion is elicitedunder assumptions appropriate to the operation of intervening, which seversthe actual dependence between size and bag. There are no probabilities forcoming to learn one fact about the partition rather than another, becausethe knowledge construed as knowledge by intervention depends only on theact of intervening.Note the different ways in which reasoning from probabilistic dependenceand reasoning by intervention treats the scope of the learning modality nomatter what is learned in relation to the scope of the assumption about size.On the probabilistic dependence reasoning, we first update the probabilityfor learning one thing or another in the future; and then compute an expec-tation with appropriate weights showing the updated probabilities, to giveus what the agent expects to think no matter what is learned. This givesthe assumption about size scope over the learning modality. But if we treatthe knowledge as arising by intervention, the learning modality has scopeover and above the assumptions about size.So what is causal reasoning in disguise? It is reasoning which revokesassumptions of probabilistic dependence, to generate conclusions which arewarranted only if those assumptions are revoked. The reasoning simulatescausation because it involves taking a stance which allows us to control thevalue of a variable, holding it fixed not by finding out that it is true and fixingit that way, but by making it true by force. If we treat the knowledge ofthe partition factor as intervention-dependent, we in effect sever the existingdependencies between the partition factor and the other factor of interest,namely size. So that allows us a certain freedom that we don’t have whenwe reason by treating the assumption of size as something that we need tocondition on.To close the discussion, I note that one might object on principle that itdoesn’t make any sense to speak of intervention in contexts free of causationbecause intervention is an operation defined for causal networks. But thisis wrong. There is of course a causal interpretation of the operation of in-tervention in causal contexts. But the operation itself is defined relative to100an ordering of variables, which we are free to interpret as we like. We mayinterpret “X comes before Y ” in the ordering as “X is a causal ancestor ofY ”, but we do not have to. Intervention is essentially an operation which iscapable of discerning more structure than ordinary conditioning; the samejoint probability function may be factorized in different ways, and condi-tioning won’t tell the difference, but intervention will. Thus when we reasonby intervention in contexts free of causation, we reason on the basis of morestructure than is there from the point of view of conditioning. That’s thelong and the short of it, at least from a technical point of view.4.3 ConclusionTo count as a distinct non-causal form of paradox, Simpson’s paradox forratios must involve a distinct non-causal intuition. I noted in this chap-ter that the intuition about ratio combination does not seem substantiveenough to be the foundation of the claim that there is a distinct form ofparadox arising just for reasoning about proportionality. But even if webracket this concern, the more robust argument from future knowledge stilldoes not establish the desired conclusion, despite appearances, because tomake it work we have to engage in a mode of reasoning which simulatescausal reasoning, without actually dealing with causal matter. In one wayor another, the deeply puzzling aspects of Simpson’s paradox refer us backto causal reasoning, and the distinction between reasoning on the basis ofan intervention versus conditioning. I do not think that the argument fromapplications of Simpson’s paradox to non-causal contexts threatens the viewthat there ought to be a unified, causal approach to Simpson’s paradox.101Chapter 5Simpson’s Paradox andCounterfactual Projection5.1 IntroductionI argued in chapter 3 that Simpson’s paradox forces us to distinguish betweena causal and a news value sense of evidence. There is neither a uniqueconcept of evidential relevance, nor a unique direction of evidential force,that can fully account for the problem that Simpson’s paradox poses for theconcept of evidence. However, a single piece of evidence can play two distinctroles (news value and causal) yielding two distinct but compatible sensesin which, for example, the information that John smokes can be evidenceagainst as well as for the hypothesis that John has cancer, depending onwhether or not we intervene to keep fixed John’s actual location.I noted in the earlier chapter that Simpson’s paradox has generally failedto resonate in the literature on epistemology. Only a limited discussion ofthese issues exists. Nevertheless, there does exist an alternative to the viewI defend here, which treats the evidential relation not as divided betweena causal and a news value relevance, but as unreliable. In this chapter, Iwill articulate and address a version of the alternative view enunciated inMorton [57], according to which the unreliability of Simpson’s paradoxicalevidence consists in its failure to project counterfactually. The probabilistic102correlation between e and h is said to project counterfactually if and only ifhad e been false, h would have been less likely to obtain (if the correlation ispositive); or, if and only if had e been false, h would have been more likely toobtain (if the correlation is negative). Thus for example the actual negativeevidential relation obtaining between smoking and cancer fails to projectcounterfactually if it is true that had John not smoked, he would have beenless likely than he is in actual fact to contract cancer. On Morton’s theory,this counterfactual is true due to the non-backtracking way we interpret thecounterfactual. On the semantics for interpreting non-backtracking coun-terfactuals we hold the past fixed up until the time when the antecedent issupposed to obtain. If John’s location played a causal, determining role inhis decision to smoke, then we may take it that if John had not smoked,he would still have occupied the same location (city or countryside) as heactually does, since the location is both causally and temporally antecedentto John’s deciding to smoke. Yet, Morton argues, relative to each of thesepossibilities, John’s failing to smoke would further decrease his chances ofgetting cancer, compared with what they actually are, due to the removal ofan exacerbating risk factor (smoking). On the basis of this reasoning, Mor-ton concludes that (a) John smokes is evidence against cancer, and yet (b)it is unreliable evidence insofar as the counterfactual probability of canceron the supposition that the evidence is false is lower than it actually it is.Morton’s view differs from the view defended here in two importantways. First, as noted, it attempts to describe the force of evidence using asingle concept of evidence, which (as it would appear from his discussion)tracks overall probabilistic relevance and therefore completely ignores thecontribution made by the causal partition to the information conveyed bythe proposition that John smokes. Second, Morton is prepared to grant thatbelief that is based on a non-counterfactually projecting evidential relationcan in principle be rational. He calls this type of belief deviant but rationalbelief. The rationality of belief formed on the basis of unreliably supportingevidence is supposed to reflect the fact that it is not entirely unresponsive to(at least) part of the probabilistic information available to the agent. Mor-ton thinks that Simpson’s paradox shows that deviant but rational belief103based on counterfactually unreliable evidence is theoretically possible. Onthe contrary, I will argue that Morton has not shown that the phenomenonis possible, much less that it is rational. My suggestion is that Morton’sargument is based on an equivocation between causal and news value evi-dential relevance. If we remove the equivocation, there is no problem withcounterfactual projection or unreliability. Considered separately, evidencein the news value sense, and evidence in the causal sense are both forms ofevidence equipped with counterfactually robust evidential force.In this chapter, I will examine Morton’s argument closely, explaininghow the equivocation arises and why removing it resolves the problem ofcounterfactual projection. But I note at the outset of the discussion thatthere is a prima facie reason to think that Morton’s conclusion is untenable.According to the usual probabilistic construal of evidence, e is evidence forh just in case e incrementally raises the probability of h. A crucial aspectof the incremental conception of qualitative evidence is that it rules out thepossibility that h could in fact be more likely given the falsehood of e thangiven the truth of e. Thus if e is evidence for h it cannot be the case the ourrational degree of confidence in h is even higher given the complement of ethan it is given e itself. This highly reasonable constraint on the relation ofevidence arises from the fact that conditional probabilities are understoodas probabilities conditional on an indicative supposition (see Joyce [36]).But we know that probabilities conditional on an indicative supposition arenot the same as probabilities conditional on a counterfactual or subjunctivesupposition. So suppose that we ask the following: just as it is incompatiblewith the idea that e is evidence for h that the probability of h is higher ife is false than it is if e is true, is it also incompatible with the idea that eis evidence for h that the probability of h if e were false would be higherthan it is if e were true? Intuition strongly suggests a negative answer.If we construe a probabilistic correlation as an evidential relation, then itseems that it ought to project counterfactually. If, however strong the actualprobabilistic effect of e on h is, we think that h would nevertheless be morecredible on the supposition that e were false, it hardly seems reasonableto claim that e provides evidence for h. Similarly for e providing evidence104against h; if we took it that h would be less credible on the supposition that ewere false, the claim that e disconfirms h seems to be irreparably impugned.The oddity of such claims of counterfactual variance, and the difference thatit marks with the indicative conception of comparative probabilities givena pair of suppositions, indicates that there is something amiss in Morton’sdefence of deviantly unreliable but “rational” evidence and belief.Counterfactual conditions on the force of evidence have been considered(and sometimes rejected) in the literature61. As I noted above, the ideathat the relation e is evidence for h (in any sense of evidence) is compatiblewith increased probability if e were false but incompatible with increasedprobability given e’s falsehood seems to me to be inherently implausible. Ifit would be more likely that h obtains if e were false than if e were true, thenit seems incorrect to judge that e is evidence supporting h in any ordinarysense. Fortunately, I think we can avoid this violation of intuition.The distinction between correlations which project counterfactually andcorrelations which do not is not merely an adventitious idea in Morton’sanalysis of the phenomenon. It is at the heart of his analysis of Simpson’sparadox. Morton thinks that in every relevant Simpson’s paradox, there isa set of correlations which projects, and a set which does not, and he thinksthat the peculiarity of the paradox arises from the fact that our eviden-tial concepts are somewhat unfortunately grounded in associations that donot project. Moreover, Morton thinks that the presence of counterfactuallyprojecting and non-projecting correlations ties Simpson’s paradox to a hostof famous paradoxes and puzzles from decision theory, such as Newcomb’sparadox.Newcomb’s paradox was introduced to philosophy by Nozick [59]. Itconstitutes an important hallmark in the debate between evidential decisiontheory and causal decision theory. In its original form, evidential decisiontheory enjoins a rational agent to maximize conditional expected utility[12, 34]. Causal decision theory prescribes choosing the act with the highestcausal expected utility [22, 36, 47, 75]. In many decision theoretical con-61See for instance Williamson [86].105texts, the prescriptions to maximize conditional or causal expected utilitywill agree on the uniquely rational act to perform. But in Newcomb’s prob-lem, causal and conditional (news value) expected utility diverge, hence (un-reconstructed) evidential and causal decision theories disagree about whichact it is rational to perform. In Newcomb’s problem, intuition howeveragrees with the verdicts of causal decision theory. Later modifications tothe framework of evidential theory have sought to defuse the problem byappealing to ratificationism, the doctrine which enjoins the agent to calcu-late expected utilities on the supposition that he is going to perform a givenact, that is, for the person he will be once he has chosen [12, 34]. To date,it is a moot point in decision theory whether supplementing standard evi-dential theory by the condition that the agent should choose only ratifiableacts (acts whose expected evidential utility outstrips the expected evidentialutility of any alternative act on the supposition that the agent will performit) succeeds in generating a justification for the intuitively correct act inNewcomb’s problem [38, 39].If Morton is correct in his conjecture that Simpson’s paradox has deepstructural affinities with Newcomb’s paradox, then we should be able togenerate an analogous epistemic version of the latter, on which one set ofcorrelations reliably projects, and the other does not. In later parts of thischapter, I investigate this conjecture more closely, arguing that although asimilar phenomenon seems to arise for Newcomb-like correlations, the puzzleis easily resolved if as I have suggested we distinguish a causal reading froma news value reading of the associations. I conclude that there is no deepproblem about counterfactual comparative probabilities either in Simpson’sparadox, or in Newcomb’s paradox, as long as we are willing to grant thedistinctions for which I am arguing here.5.2 Morton’s accountConsider the following scenario from Morton [57]. An agent has submitted apaper to an interdisciplinary journal which accepts submissions from a widerange of disciplines, including philosophy. What is known about the process106Hum(H)¬R RSoc(S)¬R RRejection (R)H (%) S (%)Overall rejection (R) (%)PhilosophyAlternative12 8888 8123 7197 79388 7090 808785Table 5.1: Rejection data, Morton 2002.is that on arrival the papers undergo a random electronic triage, resultingin half of the papers being sent for independent review to a team of refereesin the Humanities department, and the other half to a team in the SocialSciences department. Each refereeing team then assigns a label to the paperof either philosophy or alternative. It is also known that the probability ofa given paper being eventually accepted is not independent of the label; ineach department, being labelled “philosophy” is positively correlated withacceptance.We may assume that relevant frequency data describing this process isgiven in Table 5.1. It can be easily checked that the set of data is Simpson’sparadoxical. Although tending to reject the vast majority of manuscripts,each team of referees tends to reject a smaller proportion of papers labelled“philosophy” than papers receiving the “alternative” label. But overall,being labelled “philosophy” is positively correlated with rejection, despitethe fact that the correlation is reversed within each team of referees. Theexistence of the paradox is explained in probabilistic terms by two fac-tors. First, the Humanities team tends to make use of the philosophy labelmore frequently than the Social Sciences team; pr(X = philosophy ∣ Z =Humanities) = 0.1 > pr(X = philosophy ∣ Z = SocialSciences) = 0.01. Sec-ond, the Humanities team tends to reject more papers in general than the So-cial Sciences team, thus pr(Y = rejection ∣ Z =Humanities) = 0.9 > pr(Y =rejection ∣ Z = SocialSciences) = 0.8. These two conditions generate thereversal of inequality equations pr(Y = rejection ∣ Z = Humanities,X =philosophy) = 0.88 < pr(Y = rejection ∣ Z =Humanities,X = alternative) =0.9; pr(Y = rejection ∣ Z = SocialSciences,X = philosophy) = 0.7 <pr(Y = rejection ∣ Z = SocialSciences,X = alternative) = 0.8; whilstpr(Y = rejection ∣ X = philosophy) = 0.86 > pr(Y = rejection ∣ X =107alternative) = 0.85.Assume that at some point after the submission event but before helearns whether the paper has been accepted, the agent learns by means ofa private indiscretion that his paper has been labelled philosophy. This isclearly relevant information of the form “X = x” which can shape inferencesabout the probable values of Z and Y in light of the information received.Specifically, the agent wants to know how his evidence bears on the possibil-ity that his paper will be accepted. But how exactly should the evidentialreasoning proceed? On the usual understanding of evidence, supportingevidence comparatively increases probabilities, and disconfirming evidencecomparatively decreases probabilities, relative to our background knowledge.But the probability that a paper is accepted given that it is labelled philos-ophy is lower than the average or prior probability that a paper is accepted.Thus on the usual construal of evidence, the agent ought to regard the newinformation as marginally inauspicious, since it disconfirms the outcome inwhich he is invested. This of course is the verdict delivered by the usual prob-abilistic construal of qualitative incremental evidence, and Morton seems toagree with it. But he also thinks that the correlation which underlies theconventional verdict fails to project counterfactually, and hence fails to be areliable basis for belief, deemed by Morton to be deviant precisely becauseit rest on a correlation which fails to project. As a rational agent, “you” are:inclined to believe that your paper will be rejected. [...] It [thisbelief] is based on definite statistics, and not plainly irrational.But there is something deviant about it. For the fact that the‘philosophy’ label is correlated in both humanities and social sci-ences with a greater chance of acceptance. So in that way learn-ing its labelling is good news. There are two ways to express thedeviance of your belief. First, the reasoning on which it is based,though not fallacious, is not a reliable source of true belief. It isnot the case that had your paper not been labelled ‘philosophy’, itwould have been less likely to be rejected. In the nearest worldsin which it is not labelled ‘philosophy’ it is still with whichever108referees actually have it, and ‘philosophy’ labelling tends to beassociated with a judgment of quality which goes with acceptance[emphasis mine]. Second, the belief is unstable in relation tofuture evidence. You may learn tomorrow that your paper isbeing considered by the humanities referees, or you may learnthat it is in the hands of the social scientists [...] Then if youlearn that the humanities referees have your paper you will findthat it has been labelled ‘philosophy’ comforting, a suggestion ofacceptance. And so will you if you learn that the social scientistshave it. So the grounds for your earlier belief that your paperwill be rejected are going to be undermined either way the futureinformation comes in, as you can now know.(57)According to Morton, what seems to be inauspicious information inactual fact concerning Y would have been even more inauspicious had theinformation about X been false. This is essentially because (as Mortonargues) the positive correlation between the actual information describingX and the undesirable state of Y both (a) fully determines the force ofevidence in actual fact, and (b) fails to counterfactually project, so thatauspiciousness under counterfactual assumptions about the value of X isjudged by a different standard to the one applicable to actual auspiciousnessof information about X.5.2.1 Causal structure of Morton’s exampleAs we know, from a purely probabilistic point of view there are two corre-lations to consider: (a) a partition based correlation between label and out-come, and (b) an aggregate correlation between label and outcome. Thesehave different polarities. The former correlates philosophy with rejectionnegatively. The latter correlates them positively. Morton’s argument chal-lenges us to think about how these probabilistic relationships fare undercounterfactual supposition. This involves trying to envisage how counter-factual probabilities regarding the outcome of the process might change.109But the very idea that counterfactual probabilities can change at all undersupposition about label imposes a causal constraint on the structure of theproblem.Consider for the moment just the correlation between label and out-come within each department. I’m not sure that we can say whether or notcounterfactual probabilities under a change-of-label supposition are at allsubject to variation without making specific substantive assumptions aboutthe nature and causal significance of the correlation. For if the correlation ismerely a purely probabilistic correlation and does not represent a relation ofcausal relevance, then counterfactual probabilities remain invariant, and theargument does not work. It is only if the correlation has causal significancethat we may take it that the probabilities of the outcome would have beendifferent had there been an alternative presentation of the evidence.In more detail, to get the argument to work, we have to assume that labelis a causal factor in its own right, not just correlated with outcome via animplied common cause. If the label is merely correlated with outcome, we donot obtain a variation of counterfactual probability. For example, supposethat the probabilistic relevance relation connecting label to outcome is medi-ated by an unmeasured latent variable reflecting intrinsic academic quality.We may assume that the philosophy label is indicative of slightly higherquality in the entire cohort of papers; and furthermore that high qualitygenerally promotes acceptance, so an increment in quality will tend to pro-duce a positive increment in chances of acceptance. It must be emphasized,however, that from a causative point of view, it is quality that has the powerto affect the outcome, not label. Since label is merely a reliable indicator ofquality, its probabilistic effect on the outcome variable is entirely screenedoff by knowledge of quality. In such a case we would not judge that had thepaper not been labelled philosophy, it would have had a different probabilityof acceptance. It would have had exactly the same probability as it does inactual fact, precisely because the factor which affects probability (quality)is invariant to change in auspiciousness in signal of quality (label).Fortunately, this is not the only way in which we may understand theconcept of quality. We may distinguish between quality, as a reflection of110intrinsic or inherent academic merit, and the perception of quality, as aseparate factor in the analysis with the potential to contribute an indepen-dent causal influence on outcome. Suppose that the assignment of a labelis capable of generating a wholly unsubstantiated expectation of merit ordemerit, creating an environment in which papers labelled philosophy tendto be accepted at higher than normal rates not necessarily owing to intrinsicacademic value, but because referees who see the label tend to go easy onthe paper for a host of completely irrelevant reason, such as being overlyimpressed by the label, or having partisan reasons for promoting the field.In this picture, the label is an active causal factor in the shaping of theoutcome, alongside, we may suppose, the usual exogenous factors (unmea-sured in the study) that generally tend to focus on intrinsic properties of thework, such as intellectual rigour. That is, if the causal system represents thedistinction between inherent quality and perception of quality, with percep-tions of quality driven by assignment of label, then it is perfectly legitimateto reflect that counterfactual probabilities will not remain invariant if wesuppose that the evidence is presented under an alternative guise. We mayvery well reason that if the paper had received an alternative label, the prob-ability that it would be accepted would not remain invariant, by invokingthe degree to which altering the label makes a difference in a causal senseto the outcome.But then does Morton’s argument really work the way it is supposed to?If we make the correct causal assumptions, we do indeed get counterfactualprobabilities under the supposition of an alternative presentation of evidenceto vary, but do they vary in the right way? It seems to me that if we (a) agreethat the correct causal picture must represent label as a causally relevant,and not merely a correlated factor, to outcome and (b) agree to reason bothactually and counterfactually in accordance with the positive correlationbetween philosophy label and acceptance, then we have to say that (c) hadthe paper received the alternative label, it would have been less likely to beaccepted, rather than more likely. That is, we have to say that contrary toMorton’s claim, the specific positive correlation between label and outcomedoes project counterfactually. This point is taken up in more detail in the111next section.5.2.2 A single evidential relation, or two?In my view, the problem with Morton’s argument is that it is based on anequivocation between different senses of evidence. That is, it is based onthe conflation of two separate evidential relations, one tracking causal prob-abilities, and the other tracking ordinary conditional probabilities. Mortonarticulates the view that there is a single evidential relation based on theoverall association between X and Y . But he also thinks that when weswitch to counterfactual probabilities, the evidential relation loses its point,because the counterfactual supposition forces the evaluation of evidentialrelevance to take up the point of view of the partition-based correlation.From this arises the phenomenon of counterfactual deviance. Thus, I un-derstand Morton to be putting forward the following view:(M) Learning the news that X = philosophy is evidence in favour ofY = rejection. Yet had it instead been the case that X = alternative, thatwould have been evidence in favour of Y = rejection. It follows that theevidential relation which obtains in actual fact between X and Y does nothold up under counterfactual projection.But there are two different approaches here, and I think that the argu-ment mixes them up. We can agree with Morton that an evidential relationmust be based on one or the other correlation. That means that the forceof evidence can be understood either as (E) or (C) below. But we shouldresist the temptation to run them together, as they appear to be in (M):(E) Learning the information that X = philosophy is evidence in favourof Y = rejection. But if the agent had learned instead that X = alternative,that would have been evidence against Y = rejection. The evidential rela-tion (in one sense of evidence) does hold up under counterfactual projection.What underlies this evidential relation is the negative correlation between112the philosophy label and acceptance, which is rooted in simple news valueconditional probabilities.(C) Learning the information that X = philosophy is evidence againstY = rejection. But if the agent had learned instead that X = alternative,that would have been evidence for Y = rejection. The evidential relation(in the causal sense of evidence) does in fact hold up under counterfactualprojection. What underlies this evidential relation is the positive correlationbetween the philosophy label and acceptance, which is rooted in the causalprobability of acceptance given the label. Taking all causal knowledge intoaccount, we obtain a second, distinct sense of evidential support, which doescounterfactually project.We should view (E) and (C) as referring to different aspects of the force ofevidence. We should not run (E) and (C) together by equivocating betweenthem. (M) seems to do just that. Therefore, we should reject (M).The conclusion we ought to draw is that rather than viewing this typeof scenario as a phenomenon in which the evidential relation fails to projectcounterfactually, we ought to view it in terms of two kinds of evidence,causal and news value, both of which counterfactually project. Let us takeeach of these in turn. In more detail, what we want to say is this:1. In the causal sense of evidence, if the agent had different evidence,his probabilities for acceptance would be lower than they actually are. Butthis is exactly as it ought to be, since in the causal sense of evidence, aphilosophy label is evidence for acceptance, so we might indeed expect coun-terfactual probabilities to be lower if the evidence were false. Suppose wefocus on the correlation between label and outcome relative to the partitionby department. This correlation underwrites the causal sense in which in-formation about X qua cause is information about Y qua effect. That is tosay, taking all causal knowledge into account, we ought to say that the eventX = philosophy is evidence in support of the event Y = acceptance. Thus ifwe average over department data with the un-updated probabilities for each113department, we obtain the causal expected probability for Y = acceptancein light of X = philosophy. This is formally tantamount to conditioningY on an intervention on X and obtaining the requisite causal probability.Moreover, the correlation underwriting the causal sense of evidence projectscounterfactually, because comparing counterfactual probabilities under dif-ferent values for X amounts to comparing the degree to which different labelassignments are causally efficacious with respect to outcome; and we knowthat a philosophy label marginally promotes acceptance, whilst the alterna-tive label marginally inhibits it. There is a perfect fit between direction ofevidential force in the causal sense and counterfactual probabilities on thefalsehood of supposition concerning evidence.2. In the news value sense of evidence, if the agent had different evidence,his probabilities for acceptance would be higher than the actually are. Thistoo is how things ought to be, since in the news value sense of evidence,a philosophy label is evidence against acceptance, so it is fully legitimateto expect that if the evidence had been false, the probabilities of acceptancewould be higher than they actually are. There is as we have remarkeda negative correlation between X = philosophy and Y = acceptance acrossthe data unstratified by department. We may take it that this correlationunderwrites the news value sense in which the philosophy label is evidenceagainst acceptance. When we obtain this evidence, we update the probabil-ities for each department, unlike in the case in which we reckon with theforce of evidence in a causal sense. Since the Humanities favour the phi-losophy label compared with the Social Sciences, the Humanities emergesas a much more plausible possibility in light of the agent’s evidence. Butthe Humanities is also much more exacting in its standards and tends toturn away a higher proportion of papers, regardless of label. So the overallnews value of the evidence received is to push down the chances of accep-tance even lower than they actually are. Moreover, the correlation trackedby the news value force of evidence projects counterfactually. Comparingcounterfactual probabilities under different assignments for X amounts tocomparing the degree to which the actual and alternative evidence provide114information for department and hence implicitly also about acceptance. Butthe non-philosophy label is a much more auspicious indicator of department,hence if the agent had the alternative evidence, he ought to be more confi-dent than he actually is that his paper will be accepted. In the news valuesense, there is a perfect fit as well between the direction of evidential forceand counterfactual probabilities.I noted above that Morton approaches the force of evidence in Simpson’sparadox from the point of view of good news versus bad news as consideredby a rational agent with a practical stake in the outcome. Morton’s ter-minology of good and bad news should not be confused with the sense inwhich I have described the force of evidence as having news value. In fact,Morton’s terminology is equivocal. The simple question asked by Morton is:would the agent regard the information as auspicious or inauspicious, givenhis practical interest in the outcome of the causal process? But the simplequestion is deceptively simple and very possibly misleading too. It asks us totake at face value the following reasoning: The information X = philosophyis bad news. The information X = alternative would have been even worsenews. Therefore the news is not as bad as it could have been, which meansthat the agent can take some comfort from hypothetical projections of evi-dential force, in defiance of the actual force of evidence.But there is an equivocation here over what it means to be good news. Inone sense X = philosophy is bad news because it gives information about Z.More precisely, it supports the proposition that Z = Humanities. And weknow from the probabilistic data that standards in the Humanities must betougher than in the other department since they tend to reject more papersno matter what the value of X is. If we maintain this perspective on evidence,then we ought to say that X = alternative is good news for the agent for theanalogous reason. It gives information that supports Z = SocialSciences,and we know from the conditional probabilities involved that comparativelyspeaking standards in the Social Sciences are more lax. Again we round thisoff over possible values of X. Standards are more lax in the Social Sciencesno matter what the value of X is. It follows that the alternative evidencewould have been much better news in this sense of news. But now consider115the causal sense of news. We know that the assignment of the label doeshave some causal influence on the outcome, and that this causal effect ispositive with respect to acceptance. Thus information about X qua causeof Y gives good news about Y qua effect. It tells us that one crucial variablein the causal system has taken a value which causally promotes the outcomein which the rational agent has a stake. In the causal sense the informationX = alternative would have been far worse news. It is only by equivocatingon the causal and news value sense of news that we get to the counterintuitiveand mistaken idea that one can get comfort from counterfactual evidence ifone takes one’s actual evidence to be dire news. Indeed, one can take one’sactual evidence to be dire news, but only in the news value sense. And thenthe alternative news would have been better news, so there is no room forcounterfactual comfort. OR one can take one’s actual evidence to be goodnews, as far as causal value is concerned. And then the alternative newswould have been bad news, so there is no room for wishing it had beenotherwise. What one can’t do, and what Morton’s argument requires one todo, is adopt both positions on the evidence simultaneously.5.2.3 Backtracking and non-backtracking counterfactualconditionals in relation to evidenceThe distinction between a backtracking and a non-backtracking interpre-tation of the counterfactual conditional is crucial to causal reasoning. InLewis’s counterfactual analysis of causation, for instance, the distinctionbetween between backtracking and non-backtracking counterfactuals corre-sponds to the distinction between spurious and non-spurious causal depen-dence [44, 45]. A non-backtracking counterfactual in Lewis’s sense is onein which the past is held fixed almost up until the time at which the an-tecedent is supposed to obtain counter-to-fact. In a more general sense, anon-backtracking counterfactual requires us to hold fixed the causes of anevent whilst allowing the effect itself to vary as needed.Interpreting the relevant conditionals in a non-backtracking way guardsagainst the following type of inference. Suppose that Z is a common causeof both X and Y . Then we might reason as follows: had X not occurred,116Z would not have occurred (backtracking inference), hence Y would nothave occurred. So, by transitivity, if X had not occurred, Y would not haveoccurred, so Y counterfactually depends on X. Such spurious non-causalcounterfactual dependence is blocked on Lewis’s theory by requiring thatthe past leading up to the occurrence of X, which typically includes thecause Z, should be kept fixed. Alternative theories of causation in terms ofcounterfactual dependence achieve this effect through the way in which thenotion of an intervention is characterized. On the interventionist view ofPearl or Woodward, spurious backtracking inferences are ruled out, insofaras counterfactuals of the form If X shifts in value under a local intervention,Z shifts in value are always false when Z is a cause of X [62, 88]. Anintervention is a modification which changes the relation between X and itscauses, but leaves everything else unchanged, so the distribution over Z isunchanged under an intervention on X.What the present analysis of the force of evidence in Simpson’s para-doxical contexts reveals is that the distinction between backtracking andnon-backtracking counterfactual dependence is crucial to evidential reason-ing as well. When we reason about the force of evidence in accordance in thenews value sense, we use backtracking reasoning. For example, we reasonthat if John had not smoked, we would have had evidence in the news valuesense to indicate that he lives in the city, which for all we know is in fact false,but could be true in the alternative world. In reasoning in this way, we donot hold fixed either causal or temporal order leading up to John’s decisionwhether to smoke. To reflect the flexibility of backtracking assumptions, weallow the shifting balance of probabilities to shape our view about probablelocation, as opposed to keeping actual location fixed, whatever it may be. Inthe actual world, we have evidence that makes it more likely that John livesin the country rather than the city. But when we reason in the news valuesense, we allow that the balance of probabilities shifts to make city moreprobable than country, had we received the alternative information aboutJohn’s smoking habits. By contrast, holding fixed the subject’s location inthe actual world enters the picture when we reason about evidential rele-vance causally. Probable location in view of the information received plays117no role in assessing evidential force in this second, non-backtracking sense.5.3 Newcomb’s paradoxIf Morton is correct about Simpson’s paradox and Newcomb’s paradox beingcognate phenomena, we should be able to generate an analogous “eviden-tial” version of Newcomb’s paradox. In the best known version of Newcomb’sparadox, an agent is presented with a choice between an act (one-boxing)which is highly correlated with a desirable state of affairs, and an act (two-boxing) which is highly correlated with an undesirable state of affairs. Nei-ther act has the capacity to causally promote the desirable or undesirablestate of affairs. On the other hand, just as with Simpson’s paradox, there is asecond correlation between choosing the act which betokens the undesirablestate of affairs (two-boxing), and getting a very small good as a sure thing,regardless of which state of affairs obtains; whereas the act which betokensthe good state of affairs (one-boxing), is correlated with missing out on theextra small bit of good as a sure thing.For the sake of concreteness, here is a well-known version of Newcomb’sparadox62. An agent must choose between taking a single opaque box, andtaking the opaque box together with a transparent box whose contents hecan see for himself. The transparent box contains the small good of 100 dol-lars. The opaque box may or may not contain 1,000,000 dollars. Whetherit does contain this amount of money depends on a highly expert predic-tion made previously about the agent’s choice. If it was predicted that theagent would one-box, the money was placed inside the opaque box; if it waspredicted that the agent would walk away with both boxes, no money wasplaced inside the opaque box. It is understood that the the agent’s decisionis highly correlated with the prediction. If the agent takes the one opaquebox, he may take this as a reliable signal that the money is in the box; if he62Newcomb’s paradox is remarkably ubiquitous. Although the best known version bor-ders on science-fiction, it has been argued that the famous Prisoner’s Dilemma can beconsidered to be a form of Newcomb’s paradox [38, 46]. Skyrms [75] gives further convinc-ing and far more mundane versions of the paradox, including the so-called voter’s paradox.See also Lewis [47]118takes both boxes, he may take this as a reliable signal that the opaque boxcontains no money at all. But whilst his action is highly correlated withthe state of affairs which determines how much money he ultimately getsby means of his action, it is also understood in this problem that the agentcannot cause the prediction or change the state of affairs, whichever it maybe. At the time of his choice, the money either is, or isn’t, in the opaque box.This second correlation between action and outcome is a form of reasoningby dominance: no matter what the agent chooses, the two-boxing act is cor-related in the second sense with getting the extra good, whilst one-boxingmisses out.If Morton is right about Simpson’s paradox, then we should be able toreason about the pattern of correlations in Newcomb’s paradox as follows:Suppose that agent takes one box, and gets the million dollars. One-boxingis positively correlated with getting the million dollars. If this correlationis apt for counterfactual projection, we should expect that if the agent hadtaken two boxes, he would have been worse off, not better off, than he is inactual fact. And yet had the agent taken both boxes, he would in fact havebeen better off, not worse off, than he actually is, since he would have takenboth the actual million dollars, and the extra hundred on top. Similarly,suppose that he takes the two boxes, and does not get the million dollars,but only the contents of the hundred dollar box. By the same principle,there is a negative correlation between two-boxing and getting the milliondollars. If this correlation projects counterfactually, we should expect thathad the agent taken just one box, he would have been better off than heactually is having taken both. But had he taken one box, he would have beenpoorer than he actually is, having secured no money at all. It follows thatthe correlation between decision and gains, specifically between deciding toone-box, and making a million (alternatively between deciding to two-box,and not making a million, or making just one hundred dollars) does notcounterfactually project.But I think we have to say that this argument suffers from the samedeficiencies as the analogous argument about the journal submission exam-ple. The mistake in the argument is the equivocation between reasoning119about the evidence causally, and reasoning it in terms of pure news value.There are two fundamentally different types of correlation in this version ofNewcomb’s problem. There is a causal correlation between the act and theoutcome. In this sense, the act of two-boxing is correlated with getting asure thing gain of one hundred dollars. This correlation holds up counterfac-tually as well. Had the agent not taken the second box, his act would havebeen correlated with getting no money, so as we might reasonably expect,he would have been marginally worse off, not better off. On the other hand,there is a news value correlation between the act and the outcome, which ismediated by the news value that the act provides for the antecedent stateof affairs, which by contrast is absent from the causal evaluation. In lightof information about the act, the agent updates his probabilities about thepartition, which then feeds into his evaluation of expected utility. Thus if helearns that he has one-boxed, he should consider it good news in the newsvalue sense, because it raises the expectation that the money is the opaquebox, compared with learning counterfactually that he had two-boxed.5.4 ConclusionI have argued in the previous chapter that understanding the force of evi-dence in Simpson’s paradox requires us to draw a distinction between causaland news value relations of evidential relevance. In this chapter I have soughtto put the distinction to use by arguing that resorting to causal and newsvalue evidence defuses a problem that Morton [57] raises with respect to thepossibility of projecting one’s conclusions about actual evidence counterfac-tually in Simpson’s paradoxical evidential scenarios. However, the discussionis not complete. In the next chapter, I will address the remaining part ofthe problem, which is the problem of belief in relation to future evidence.120Chapter 6Simpson’s Paradox and theSure Thing Principle6.1 IntroductionThis chapter explores the implications posed by Simpson’s paradox for aclassic principle of decision theory, the Sure Thing Principle. Informally,STP states that if a rational agent prefers an option f to a different optiong both when he knows that an event E obtains, and when he knows E doesnot, then he prefers f to g unconditionally, in ignorance whether E holds.Simpson’s paradox has been charged with raising a number of seriousproblems for STP. The first is a problem about the validity of STP. In hisseminal paper Blyth [8] argued that when the data is Simpson’s paradoxical,STP recommends the wrong act and is therefore “false” or “not applicable”relative to a set of gambles described by Blyth by way of exploiting SPdata63. The second problem may be dubbed the problem of partition in-variance64. Whilst Blyth (reportedly with Savage’s endorsement) argued63Blyth [8], pages 365-66. The argument given by Blyth is traced and elaborated inMalinas [51], pages 270-75, who rejects the problem as “spurious”. See also Malinas [50].64An early but valuable work dealing with the implications of partition sensitivity forSavage’s decision theory is Lewis [47]. The problem of lack of partition-invariance forSavage’s theory is now notorious. For a recent discussion see Arntzenius [3] and theimportant treatise on causal decision theory by Joyce [36].121that STP sometimes appears to make the wrong recommendation, Gibbardand Harper [22] famously argued that STP makes incompatible recommen-dations, depending on how we select the partition. Gibbard and Harperapparently did not think that their argument invalidates STP. To deal withtheir own objection, they proposed that STP should be replaced by twodifferent forms of sure thing principle reasoning, an epistemic STP and acausal STP. But ultimately, they did think that causal STP recommendsthe correct act, and by extension, that the incompatible answer given byepistemic STP is wrong as applied to the individual decision problem. Sothey must have thought that their counterexample involving incompatibleacts also implies the “non-applicability” of STP in a general sense.However, the two objections are not the same. There is an impor-tant difference between the charge of invalidity and the charge of partition-sensitivity levied against STP. Gibbard and Harper do not deny that if wegrant the premises of the epistemic STP (in a scenario in which the epistemicversion clashes with the causal STP) we must also grant the conclusion ofthe epistemic STP reasoning. Blyth by contrast does deny that the conclu-sion follows from the premises. In this chapter, I will examine and respondto both objections. I go on to reject Blyth’s objection of invalidity (I agreewith the view in Malinas [51] that the problem is largely “spurious”) whilstgranting and indeed expanding the charge of partition-sensitivity. To makesure that the terms are correctly set out, I will therefore define them asfollows:Invalidity The conclusion of an instance of STP reasoning does not fol-low from the premises.Partition-sensitivity Which act is recommended by STP depends onhow the partition is selected.The distinction between these two different ways in which STP reason-ing can go wrong means that one can have partition sensitivity withouthaving invalidity (or conversely). As I see it, the very reason why partition-122sensitivity is so challenging is that one cannot reply that the reasoning itselfhas misfired. Indeed there is nothing in the Gibbard and Harper-style ofcounterexample that shows that when one derives the conclusion of STPfrom its premises in either the epistemic or the causal sense one has rea-soned incorrectly. On the other hand, one needs to have a principled way toreject the misleading partitions, whilst upholding those which give the rightone. As I will go on to show, the problem is far deeper than Gibbard andHarper seem to have suspected: causal partitions can make different validlyderivable recommendations, relative to each other. To resolve this problem,I will suggest that one should take the “meet” of the causal partitions, butat the same time, it is obvious that the process of regenerating dissent byrepartitioning the data by bringing in more and more causal factors couldin principle go on indefinitely. There is in my view no solution that willremove the phenomenon from the conceptual landscape of decision theory,though better causal knowledge should make the problem more tractable.The Sure Thing Principle is a cornerstone of the theory of rational prefer-ence developed by L.J. Savage in his classic 1954 “Foundations of Statistics”.The axiomatic system to which STP belongs is at the very heart of Savage’sattempt to justify subjective expected utility maximization as the foundationof rational decision making. To show that the rational choice-worthy act isthe act which maximizes expected utility, Savage gave an axiomatic charac-terization of preference among actions or options which can be expected toproduce desirable consequences. He then showed in his famous representa-tion theorem that any ranking of preferences which respects his axioms isconsistent with the hypothesis that the agent is ranking acts according toexpected utility. The representation theorem amounts to a proof that anyagent whose preference ranking satisfies the axiomatic constraints (includingSTP) imposed by Savage is acting as if maximizing expected utility.The centrality of the Sure Thing Principle at the core of the axiomati-zation of rational preference to the justification of Savage’s decision theoryhas led one prominent contemporary decision theorist to write that Savage’scase for expected utility maximization virtually rests on the plausibility of123these formal axioms as requirements of practical rationality65. Yet to dateSTP remains a contentious axiom. Various challenges have been raised,including those arising from the famous Allais paradox, and Ellsberg para-dox. Opinion as to whether STP is a legitimate constraint on preferenceremains split. But there are at least two specific challenges arising for STPin Simpson’s paradoxical scenarios. In this chapter, I defend STP againstboth challenges. The first challenge is that STP is invalid, since apparentcounterexamples to STP seem to arise when the probabilistic structure ofthe decision problem respects the Simpson Paradoxical inequalities. I willargue that these counterexamples don’t stand up to closer scrutiny. STPhas nothing to worry about from them. More importantly, the second prob-lem is that STP gives rise to apparently incompatible verdicts, insofar asdifferent applications of STP to different partitions of the logical space yielddifferent expected utility maximizing acts. Although not insuperable, thisproblem is not easily remediable either. Ultimately however I suggest thatthe fact that STP is partition-sensitive is something that decision theoryought to accommodate.The structure of this chapter is as follows. First, I will give some nec-essary (albeit abbreviated) background on STP to render my subsequentdiscussion intelligible. Next, I will introduce and develop the two challenges(1) the challenge to the validity of STP arising from SP scenarios (2) thechallenge arising from partition sensitivity. In the following third section,I will respond to each objection in turn. Since I think that the secondobjection is far more formidable than the first, most of the substance ofthe discussion will be dedicated to getting the second problem right, andresponding to it as far as possible; following which I conclude in section 4.6.1.1 Background to STPThe Sure Thing Principle is an axiomatic principle of the influential theoryof rational preference developed by L.J. Savage. The theory of rational pref-erence starts with the idea that a rational agent is capable of ordering the65See Joyce [36].124various options open to him. In order to count as rational, Savage suggeststhat the resulting ranking of preference among options must satisfy certainconstraints. The Sure Thing Principle is one such axiomatic constraint onthe notion of rational preference. STP states that if an agent prefers an op-tion to another both conditional on knowledge that a contingency about theworld obtains, and conditional on knowledge that it does not, then he oughtto prefer the former to the latter unconditionally, in ignorance whether theevent obtains. For example, if you are committed to giving treatment to apatient both if you knew that their blood group were A, and if you knew itwere AB, B, or O, then there appears to be no need to test for blood type.Informally:(Sure Thing Principle) If you would definitely prefer g to f , eitherknowing that E obtained, or knowing that ¬E obtained, then you definitelyprefer g to f . 66Although perfectly intuitive, indeed in most contexts verging on plati-tudinous, postulating STP as an axiom of decision-making famously neededqualification. Savage himself noted that STP is inapplicable when it is withinthe agent’s power to affect the probability distribution affecting E. Supposethat you are offered a cup of tea laced with arsenic. No one should betempted into reasoning that either one will die after drinking it, or survive,and in each case it is better to have drunk some tea than to have abstained,hence it is better to drink the tea. Savage proposed to handle such problemcases by restricting the application of STP to partitions which are indepen-dent of the agent’s range of acts. In a somewhat similar but more distantvein, it is pointed out by Aumann et al. [4] that the Sure Thing Principle isinapplicable when the decision maker’s knowledge affects what the authorscall the quality of the consequence. For example, suppose that whilst han-dling the spine of a book, you reflect that if you knew it was written by A,66 In Savage’s theory, f and g can be acts or choices or strategies “of almost any sort”,in other words, the less said about their specific nature, the better it is for the theoryemploying them as useful abstractions.125you’d enjoy reading it; alternatively, if you knew it was written by B, you’dalso enjoy reading it. It doesn’t follow that in absolute ignorance of theauthor of the book in question you would prefer reading it to not readingit. For to enjoy a book requires knowing who the author is, although it maynot matter to you specifically whether it is A rather than B.Most of Savage’s commentators follow his lead in acknowledging that hisaxiom calls out for a modest refinement, in terms of explicitly limiting STPreasoning to decision-making problems requiring causally act-independentpartitions. However, it is possible to overstate the problem. For in thevicinity of the spurious reasoning from causally influenceable premises liesa non-spurious form of counter-argument with STP premises granted as itwere by an “intervention”. Suppose it is granted as a matter of fiat that theagent will die; then it seems reasonable to think that it is better for him tohave drunk the tea than not; and similarly if we “intervene” to make it thecase that he survives, then again it is better to have enjoyed the drink thanabstained. From which it follows that if the agent supposes the decision tobe out of his control, for example by imagining that he has already takenit, so it’s unalterably in the realm of the past tense, it is legitimate to thinkthat having settled the matter one way or another, it is better to have drunkthe tea than not. To be sure, such reasoning should never itself contributeto the formation of a decision, since it removes the point of deliberationfrom the agent’s grasp. But it doesn’t seem that to assess the minor goodthat drinking the tea contributes from a hypothetical-retrospective point ofevaluation in which the decision has been fixed immutably involves one incognitive error. Such reasoning is not invalid. Rather it is dangerous andself-undermining if taken as a guide to deliberation, in which the operativenotion ought to be that of a free act of the will as opposed to an interventionor a “given”. What STP requires of us is the ability to distinguish goodretrospective arguments from bad deliberative ones. If one should live ordie regardless of the choice one is about to make, it is better that oneshould live or die having drunk the tea than not. But nothing follows fromthis that bears upon what one ought to do, since in deciding what one wouldto do, one acknowledges that future states are contingent consequences (and126CAliveDead¬CAliveDeadAliveC (%) ¬C(%)Overall Alive (%)StandardNew50 9501000 90005000 500095 55 5010 954611Table 6.1: Treatment data, Chicago.not constituents) of one’s actions.6.2 SP challenges to STP6.2.1 First challenge: validityThe connection between Simpsons’ Paradox and the Sure Thing Principlewas identified by Colin Blyth in his classic paper. The basic point is thatcontexts in which the Simpson paradoxical inequalities are satisfied appearto freely generate counterexamples to STP reasoning. That is to say, bymeans of Simpson’s paradox we appear to be able to generate scenarios inwhich the epistemic premises of STP are satisfied, and the conclusion is not.Take the data in Table 6.1. It is supposed to represent information aboutthe outcome of a fictitious medical trial, aimed at assessing the efficacy of anew drug compared with the standard drug given as treatment of a poten-tially lethal disease. The data sums up survival rates among patients fromChicago (C) and elsewhere (¬C). The data is clearly Simpson paradoxical,with the new drug outperforming the standard, within each geographicallocation; and the standard drug outperforming the new overall.Suppose (argues Blyth) that we define a pair of options f and g as follows:f: draw patients at random from Table 6.1 until you encounter a patientwho was administered the standard treatment, and bet x amount of moneythat the patient is alive.g: draw patients at random from Table 6.1 until you encounter a patientwho was administered the new treatment, and bet x amount of money that127the patient is alive.Observe that it seems that g dominates f , both under the assumptionthat one knows that the patient is from C, and under the assumption thatone knows that the patient is from ¬C. That is, :(i) Given knowledge of C, g is preferable to f ;(ii) Given knowledge of ¬C, g is preferable to f .But that means that applying theSure Thing Principle to (i) and (ii)yields that :(iii) g is preferable to f .However, (iii) is clearly false. Knowing neither that C, nor that ¬C, it isf which is preferable to g, since f gives you about four times the probabilityof winning x amount of money compared with g, in a situation in whichyou are ignorant of the status of the patient with respect to the partitioningparameter. Therefore, the Sure Thing Principle is invalid.Consider next a variation on this basic argument. Note that we can givethe argument against STP from SP data an extra twist by making knowl-edge of the partitioning variable explicit in the way the options are defined.Suppose we redefine the options f and g into pairs of conditions which ex-plicitly depend on knowledge of C i.e. knowing which of {C,¬C} obtains isbuilt in to f and g. Let fC and gC be defined as follows:fC : draw patients at random from Table 6.1 until you encounter a patientwho was administered the standard treatment. It is now revealed whether thepatient is from C or not 67. If from C, bet x amount of money that the pa-tient is alive. Otherwise repeat the draw until you get a patient from C.67Imagine that there is a public annoucement to this effect, or that a third party stepsin and gives you this information for free.128gC : draw patients at random from Table 6.1 until you encounter a pa-tient who was administered the new treatment. It is now revealed whetherthe patient is from C or not. If from C, bet x amount of money that thepatient is alive. Otherwise repeat the draw until you get a patient from C.What is the structure of fC and gC? In essence, both are constructedso that the bets are made conditionally on the information that the patientselected is from C. Whichever option one chooses, one knows that x amountof money will be wagered on the event that the patient is alive only if the pa-tient is from Chicago. The conditionality means that the information aboutpatients not from Chicago is effectively ignored when one chooses betweenfC and gC . In short fC and gC are defined relative to the sample restrictedto C patients. Analogously, define f¬C and g¬C as follows:f¬C : draw patients at random from Table 6.1 until you encounter apatient who was administered the standard treatment. It is now revealedwhether the patient is from C or not. If from ¬C, bet x amount of moneythat the patient is alive. Otherwise repeat the draw until you get a patientfrom ¬C.g¬C : draw patients at random from Table 6.1 until you encounter a pa-tient who was administered the new treatment. It is now revealed whetherthe patient is from C or not. If from ¬C, bet x amount of money that thepatient is alive. Otherwise repeat the draw until you get a patient from ¬C.It is easy to see that the indexed g strategy dominates the indexed fstrategy, considered in each index category:(i-*) gC dominates fC .(ii-*) g¬C dominates f¬C .129Does it follow that:(iii-*) The conjunction of strategies gC and g¬C dominates the conjunc-tion of strategies fC and f¬C .On the face of it, no. For (iii) is virtually equivalent to the false judg-ment that in the previous model,(iii) g dominates f .To see this, suppose that a draw is made in accordance with gC andg¬C , until one encounters a patient treated with the new treatment. At therelevant moment in the draw it is revealed that C; then by the definitionof g¬C the draw is repeated, and by the definition of gC , you are to betx amount on the patient being alive. Alternatively, suppose that at therelevant moment in the draw it is revealed that ¬C; then by the definitionof gC , the draw is repeated, and by the definition of g¬C , you are to betx amount of money on the patient being alive. Apart from the fact thatthe sequence of draw appears to extend indefinitely, the problem is that youare effectively betting x amount of money that a patient treated with thenew treatment from either C or not C pool will go on to recover, when youcould be betting the same amount of money that a patient treated with thestandard treatment is alive, which as noted, gives you a fourfold increase ofprobability of being alive. In other words (iii-*) is false and (iv-*) is true:(iv-*) The conjunction of strategies fC and f¬C dominates the conjunc-tion of strategies gC and g¬C .So it seems that although the set of options {fC , f¬C} dominates {gC , g¬C},each g type of choice dominates the correlate f type of choice, i.e. gC dom-inates fC and g¬C dominates f¬C . This concludes the rendition of Blyth’sargument, which I have given with a little more detail than it appears in histext.1306.2.2 Second challenge: partition sensitivityHere is the second problem that Simpson’s paradox poses for the Sure Thingprinciple. As I have explained, one ought to regard it as an independentproblem, targeting not the validity but the applicability of the reasoning.The problem stems from the fact that applications of the Sure Thing Princi-ple to Simpson’s paradoxical scenarios are sensitive to the background par-tition. This is evidenced e.g. by the classic Gibbard and Harper examplediscussed below, in which the act that maximizes expected utility relative toone partition (the causal partition) fails to maximize it relative to another(the epistemic partition), and conversely. But partition sensitivity meansthat there is no unique answer to the question of which act it is that max-imizes expected utility. Arguably, the fact that STP fails to recommend aunique act in the relevant type of SP scenario is a fatal flaw of the theory.Cases of decision instability aside68, it seems to be a condition of adequacyof any decision theory that it should enable a decision maker to come to afinal conclusion as to which act to perform. But SP cases are quite clearlynot cases of decision instability. So in SP cases one does expect the processof deliberation to terminate conclusively and without disagreement.In response to this problem, Gibbard and Harper have suggested thatthere are in fact two different Sure Thing principles, one premised on causalindependence, and another premised on evidential or stochastic indepen-dence. In cases in which the deliverances of the causal and epistemic STPdiffer, Gibbard and Harper argue that the rational preference is given by68The celebrated Death in Damascus story occurring in the same paper exemplifies thenow notorious phenomenon of decision instability. Decision instability was redubbed byJeffrey as unratifiability. An unratifiable act A is one which the agent would not endorseon the supposition that it is chosen “for the person one would be if one were to choose”.If an act A is ratifiable then there is no alternative act with a greater expected utilityon the supposition that one were to perform A. See Jeffrey [33]. Decisions lacking aratifiable act such as the Death in Damascus scenario tend to strike us as deficient; itisn’t easy to see how the process of deliberation can terminate, except by brute force e.g.imposing a penalty for not reaching a decision, when the fact of supposing that a choicewere reached generates evidence that one would have done (or would do) better with analternative. Weirich [85] provides a useful perspective, defending causal decision theoryfrom the threat of instability. A state of the art discussion is found in Joyce [39]. See alsoPearl [63] for a fascinating statistical re-imagining of Death in Damascus.131the causal application of sure thing reasoning. According to Gibbard andHarper, in SP cases the process of practical deliberation does terminate con-clusively after all, because the disagreement is resolved by ignoring one STPprescription in favour of the other.The purpose of this section is to outline their argument. The intentionof Gibbard and Harper was to present a scenario which combines stochas-tic independence and causal independence, showing that (act) stochasticdominance and causal dominance do not amount to the same thing. Un-fortunately, due to an arithmetical error appearing in the original paper,and which later reprints have failed to correct, it is impossible to presentthe argument exactly as Gibbard and Harper presented it. At best, evencorrecting the arithmetic, their argument enables them to draw conclusionsabout weak dominance, whereas their goal seems to be the notion of pref-erence ranking based on strong dominance. Below I substitute appropriatefigures that make the argument work as intended, instead of as it exists inprint.As noted, the Sure Thing Principle recommends rationally preferring anoption f , if f is preferred both conditional on knowing E, and supposingknowledge of ¬E, for a suitable background proposition E. This may giverise to the naive expectation that the Sure Thing Principle recommends thesame option f relative to any suitable partition of the logical space. But thenaive expectation concerning STP is demonstratively false. In particular,it fails when the probabilistic information is Simpson’s paradoxical. Thefollowing example from Gibbard and Harper can be used to make the case.(Reoboam and the Advisory Board) The newly ascended king Re-oboam is deliberating whether he wants to reign severely (S) or leniently (L),in a kingdom in which kings of both kinds are sometimes deposed early dueto popular uprising against them. As it happens Reoboam would prefer toreign severely to reigning leniently, all else being equal. But he cherishes hiscrown, and would rather reign indifferently (that is, leniently or severely)than be deposed and not reign at all. Contemplating this problem, he thinksthe possible outcomes (each having some degree of positive probability) are as132Deposed ¬DeposedLenient 0 80Severe 10 90Table 6.2: Utilities Reoboam.CharismaticD ¬DUncharismaticD ¬ DDeposed (D)Ch (%) Unch(%)Overall deposed (D) (%)severe (S)lenient (L)16 242 88 222 1840 8020 554848Table 6.3: Reoboam data.follows. He could elect to reign severely, eventually encountering deposition(SD), or his severe reign might meet with no deposition (S¬D). Similarly,electing to be lenient might be followed by deposition anyway (LD), or byno deposition at all (L¬D). Assume that his utilities, respecting his totalpreference for any reign over no reign at all, as well as his penchant forseverity, are as in Table 6.2. What should Reoboam do? The king’s totalevidence is given in Table 6.3. The evidence is based on a sample of onehundred similar but long gone kings of the realm; the resulting historical dataprovides the basis for Reoboam’s subjective probabilities.What is the rational act for Reoboam to choose given the total infor-mation available to him? Assume that charisma is an innate property.Charisma, we may venture to say, is native and unalterable, so a king can-not alter his degree of charisma by acting in a certain way. So what doesthe probabilistic information tell us? Statistically, it shows that charismatickings tend to be severe, whereas uncharismatic ones tend to act leniently. Onthe other hand, charismatic kings tend to get away with not being deposed,no matter which type of rule they implement, whereas uncharismatic onesappear to precipitate revolts leading to their deposition, irrespective of thenature of their rule. Thus the blandishments of an uncharismatic king don’tseem to avail much against popular discontent, whereas (perhaps surpris-ingly) the draconian severity of charismatic ones doesn’t foment discontent133on the same level. Compare the probability that a severe charismatic kingis deposed (0.4) with the probability that a lenient uncharismatic king isdeposed (0.55). The latter is still higher, which tells us something about theeffects of charisma on the politics of the realm. But the integration of thesetwo facts across the data (the correlations between charisma type and ruletype, and effect of rule type given charisma type) leads to the washing out ofprobabilistic significance. Overall, deposition is stochastically independentof type of rule, since kings who act leniently have the same probability (0.48)of being deposed when integrated across types of charisma.To understand the structure of the problem, argue Gibbard and Harper,it is important to distinguish two types of independence. Charisma iscausally independent of type of rule. And outcome of rule is stochasticallyindependent of type of rule. Note that stochastic or probabilistic indepen-dence is a symmetrical relation. If A is probabilistically (in)dependent ofB, then B is probabilistically (in)dependent of A. But causal independenceis not. Considered as cause, charisma may well be causally relevant to typeof rule. But the type of rule a king implements during his reign has byassumption no causal bearing on whether or not the king is endowed withcharisma.Next, suppose we give the story a suitably dramatic flourish by inventingtwo types of advisors to the king. Eminence grise no 1 (Gris Sombre) bendsthe king’s ear with the following argument. According to Gris Sombre, thereare two possibilities of interest: Reoboam is charismatic or he is uncharis-matic. Each possibility is causally independent of what the king does, i.e.of the act space. On the assumption that he is innately charismatic, it isrational to prefer lenience to severity. Calculating expected utilities of le-nience, relative to each causally independent assumption, we get:Expected utility of severity, if charismatic = u(SD) × pr(D ∣S ∧C) + u(S¬D) × pr(¬D ∣ S ∧C) = 0.4 × 10 + 0.6 × 90 = 58Expected utitliy of lenience, if charismatic = u(LD) × pr(D ∣L ∧C) + u(L¬D) × pr(¬D ∣ L ∧C) = 0.2 × 0 + 0.8 × 80 = 64134Thus, if the king is charismatic (C):Expected utility of severity < Expected utility of lenience.On the other hand, if the king happens to be uncharismatic, his expectedutilities once again line up in favour of lenience:Expected utility of severity, if uncharismatic = u(SD) × pr(D ∣S ∧ ¬C) + u(S¬D) × pr(¬D ∣ S ∧ ¬C) = 0.8 × 10 + 0.2 × 90 = 26Expected utility of lenience, if uncharismatic = u(LD) × pr(D ∣L ∧ ¬C) + u(L¬D) × pr(¬D ∣ L ∧ ¬C) = 0.55 × 0 + 0.45 × 80 = 36 .Thus, if the king is uncharismatic (¬C):Expected utility of severity < Expected utility of lenience.Finally, for the last step,(STP aplied to the causal partition). Since either C or ¬C mustobtain, and in both cases lenience trumps severity, it must be the case thatoverall, not knowing which of C or ¬C obtains, lenience is preferable toseverity.If lenience is the act which maximizes utility relative to the causal parti-tion, then the king should choose to be reign leniently. But now imagine thatthis inference is strenuously opposed by the king’s second advisor, Eminencegrise no 2 (Gris Clair). Gris Clair argues that on the contrary, the parti-tional possibilities relevant to the king’s decision problem are those whichare stochastically independent of the acts considered. Namely, the possi-bilities that the king will be deposed, or that he will not be deposed. Thestochastic independence is easy to see, since both pr(D ∣ S) and pr(D ∣ L)are 0.48. Note that on the assumption that he will be deposed, the kingprefers to reign severely to reigning leniently. Similarly, on the assumption135that his reign will be long lived, the king prefers severity to leniency. Thusby a simple argument from dominance, it ought to be rational for him toselect reigning severity.Do Gris Sombre and Gris Clair disagree with each other? They do. Theydisagree for the obvious reason that their opinions cannot be reconciled, sinceeventually the king must choose one act or the other, so there is no roomfor giving each his due. As with any case in which the conflict is genuine,one side to the argument must be vindicated as against (and give way to)the other. Which? Gibbard and Harper suggest that the causal argumentis the correct argument. I assess their argument below.6.3 Responses6.3.1 Response to the first challenge: validityConsider again f and g before the procedures have been carried out, that isbefore any draw has been made. If you don’t know anything about whichpatient is going to be selected, that is, whether they are from Chicago ornot, then you should indeed prefer f to g.So what does it mean to have knowledge that the patient is from Chicago?Before the draws have been made, this cannot mean that the draws aremade conditional on the information that the patient is from Chicago. AsI explained above, this yields a different set of options, fC and gC , whichare different options from simple f and simple g, because they are definedconditionally. After the draws are made, there are two different patient toconsider, the one selected by f and the one selected by g. The possibilitiesare that both patients are from Chicago, both are not from Chicago, f-patientis from Chicago and g-patient is not, f-patient is not from Chicago and g-patient is. In the first two cases, you would indeed prefer simple g to simplef . Also if g − patient is not from Chicago, and f − patient is, you wouldalso prefer to wager the money on the former recovering, rather than thelatter. But if the f − patient is not from Chicago, whereas the g − patientis, then you should prefer f to g. So if we disambiguate the options after136the draws have been made, and advert to possible states of knowledge of thegeographical origin of each patient, we don’t have a generalized argumentfrom dominance.In short, the counter-example fails because one way of disambiguatingthe condition that we know whether or not the patient comes from Chicagoyields different, conditional procedures, which means that the premises ofthe original argument against STP aren’t satisfied; and the other yields noargument from dominance, because in one relevant possible case, the rationalagent ought to prefer f to g.6.3.2 Respose to the second challenge: partition sensitivityGibbard and Harper’s responseIn the same paper in which Gibbard and Harper point out the partitionsensitivity of STP, they also attempt to provide a response to the problemthat multiple applications of STP can in principle yield dissonant verdictson what the expected utility maximizing act actually is for one and thesame decision problem. Their response is simply to distinguish between acausal form of STP, and an epistemic form of STP, or between applicationsof STP with causal independence, and applications of STP with epistemicindependence assumptions.In situations in which the two different forms of STP yield conflictingprescriptions, Gibbard and Harper advise giving precedence to the prescrip-tion recommended by the causal form of STP. For example, as a matterof philosophical intuition, their recommendation for Reoboam is to act le-niently. It is unclear whether they think that the epistemic form of STP isof any independent use. On their account, epistemic STP seems to be of useonly when it agrees with causal STP. But the stronger worry for epistemicSTP is that it seems to make misleading recommendations (such as in theReoboam example), so it’s not clear whether they think that one shouldhave an epistemic form of STP at all. In any case, as I will show nexttheir strategy doesn’t work, because it is not general enough. In particular,137it does not deal with cases in which the different prescriptions arise not be-cause of conflict between an epistemic perspective and a causal perspectiveon expected utility, but because of conflicts between two different causalperspectives, that is, perspectives which licence the same, causal form ofSTP.Reoboam Redux, or, “M” is for MachiavelliGibbard and Harper’s strategy of defending STP by drawing a distinctionbetween causal and epistemic applicability fails. The reason it fails is this.The most significant problem posed by SP scenarios for the Sure Thing Prin-cipleis a phenomenon which Gibbard and Harper fail to notice, which is thatdifferent causal applications of STP can still produce different recommen-dations. That is, one can give two different causally independent partitionsof the same logical space, such that applying STP to one partition recom-mends an act which fails to maximize expected utility relative to the otherpartition (and conversely).Going back to the Reoboam data, recall that the act that STP rec-ommends relative to the act-independent causal partition by charisma islenience. But as I will now show, one can engineer yet another partitionon the same sample space which produces by a third, causal applicationof STP, the exact opposite prescription, namely that the utility maximizingact is in fact severity.Returning to the story, consider the following elaboration. Supposethat the very same sample of kings was re-tested for a different ineffablecausally independent property, the property of Machiavellianism. As withcharisma, assume that Machiavellianism is ingrained, innate, and irremedia-ble by act of will, a mere accident of biography, but of great consequence asportent of success in the sphere of politics (whereas in Gibbard and Harper’scharming fable, charisma was identified by dissecting the pineal gland, wemay suppose that Machiavellianism resides in the testes of the dead kings,and it was only later, when the times grew less reverent of royalty, that thecourt anatomists dared the more advanced acts of desecration). The data138MacD ¬D¬MacD ¬ DDeposed (D)Mac(%)¬Mac(%)Overall deposed (D) (%)severe (S)lenient (L)22 188 22 816 2455 2080 404848Table 6.4: Reoboam redux.emerging from the new experiment is given in Table 6.4.Observe that relative to the new data, the original SP paradoxical corre-lations are systematically inverted. Obviously, since the sample is the samesample of individuals, some features of the model don’t change. (In factquite clearly they can’t change, on pain of incoherence.) For instance, thepercentage of deposed kings as a function of severity or lenience doesn’tchange (48 %). Whether or not a king is deposed is (still and forever more)probabilistically independent of type of rule enacted upon the populace.69And so forth. But something important has changed relative to the newdata, which is that lenience and severity act in exactly the opposite wayprobabilistically - given Machiavellianism and its absence - as they do givenpossession or lack of charisma. It is now leniency - and not severity - whichis positively correlated with deposition, both given Machiavellianism, andgiven its absence, from the make-up of a king. Compare pr(D ∣ L ∧M)= 0.8 with pr(D ∣ S ∧M) = 0.55; and furthermore pr(D ∣ L ∧ ¬M) =0.4with pr(D ∣ S ∧ ¬M) = 0.2. In short, whereas relative to the partition bypossession of charisma, severity was probabilistically positive to deposition,relative to the partition by possession of Machiavellianism, it is leniencewhich is probabilistically positive towards deposition. (One way an augurand an omen another.)How does this new causally independent partition change things forReoboam and his advisors? By analogy with the causal argument fromcharisma, the possibilities are that the king possesses the attribute of Machi-69It is useful to check all the other invariances, just to make sure this is the same sample.Eg. brute number of deposed kings in Table 6.3 equals brute number of deposed kings inTable 6.4, etc.139avellianism, or that he does not possess it. This changes the calculation ofexpected utilities in the following way. I will not repeat the calculations;following the same pattern as before, it can be easily checked that:eu(S) if charismatic = eu(L) if not Machiavellian = 58eu(L) if charismatic = eu(S) if not Machiavellian = 64eu(S) if not charismatic = eu(L) if Machiavellian = 26eu(L) if not charismatic = eu(S) if Machiavellian = 36From these four basic calculations, it evidently follows that:If Reoboam is Machiavellian, the expected utility of lenience isless than that of severity.If Reoboam is not Machiavellian, the expected utility of le-nience is less than that of severity.By application of STP on the basis that Reoboam must either be, or failto be, Machiavellian it follows that:(EU-STP - causal 1) Expected utility of severity is greater thanthe expected utility of lenience.which, to recall, contradicts the conclusion gained by application of STPto the charisma partition, that:(EU-STP - causal 2) Expected utility of severity is less than theexpected utility of lenience.Taking stockGibbard and Harper show that there can be a causal partition and anepistemic partition on the same data, such that STP delivers different rec-140ommendations with respect to each. I have shown that that there can bea causal partition and a second causal partition on the same data, suchthat STP delivers different recommendations with respect to each. Fromthis it follows that the tactic of distinguishing among an epistemic and acausal reading of STP does not address the problem (though no doubt it isa valid distinction to make). The situation is this. Relative to the partitionby epistemically independent outcome (king will be deposed, or he won’t),STP recommends severity. Relative to the partition by the first causallyindependent property (king possesses charisma, king does not), STP recom-mends leniency. Relative to the partition by the second causally independentproperty (king possesses Machiavellianim, king does not), STP recommendsseverity. And so forth. We may imagine a series of causal partitions, eachpartitioning the logical space by a different causal factor, with the propertythat the utility maximizing act relative to each Πi either agrees, or disagrees,with the utility maximizing act recommended by the previous member of theseries Πj . And therefore, each causally recommended act either agrees ordisagrees with the one stochastically independent recommended act, whichnever changes simply because it is independent of causal structure. It makesno difference to stochastic independence how one parses the causal structureof the world, since whatever it is, there can only be so many deposed lenientkings, and so many deposed severe ones, and so forth.A way around the problemThere is some degree of disarray in this picture of apparently competingseries of causally apt STP recommendations, but I don’t think it invalidatesSTP reasoning.Consider again the example of the king. There are two partitions toconsider, the partition by charisma, and the partition by Machiavellianism.We may form the product of this partition, yielding the following four basicpossibilities:Πcharisma × Πmachiavel = The king is charismatic and Machiavel-141lian (CM); The king is charismatic and not Machiavellian (C¬M);The king is uncharismatic and Machiavellian (¬CM); The king isuncharismatic and not Machiavellian (¬C¬M).Now, depending on the details on the case, we may or may not be able torepresent the combined information in Table 6.3 and Table 6.4 in a uniqueway 70. Let’s make some simplifying assumptions in order to build up aserviceable (precise) set of combined data 71. I will make the following fourassumptions:1. I will assume that every severe deposed charismatic king was alsomachiavellian. (i.e. no SDC¬M possibility)2. I will assume that every severe non-deposed machiavellian king wasalso charismatic (i.e. no S¬DM¬C possibility).3. I will assume that every lenient deposed non-machiavellian is also notcharismatic. (i.e no LD¬MC possibility)4. I will assume that every lenient non-deposed uncharismatic is alsonon-machiavellian (i.e. no L¬D¬CM possibility).In short, the combined data might look as in Table 6.5.If that’s the data, and the new partition is {CM,C¬M,¬CM,¬C¬M},70Perhaps no one knows any longer which pineal gland goes with which set of testes (atone extreme), or perhaps every part of the anatomy of a king was recorded and tracked(at the other).71To stress, these assumptions aren’t necessary. But they are helpful. It’s good towork with a model in which the probabilities are defined, rather than just falling incertain intervals. Besides which, the story is up to us to define. We could just stipulatethat the data in Table is the data gathered by the anatomists, and that’s the end of it.But a stipulation might give the wrong impression, which is that combining informationpertaining to a pair of causal partitions always yields a precisely defined unique informationstate. It needn’t, and that’s obviously a complication that a more evolved discussion ofthe Sure Thing Principle would have to take account of.142CMD ¬DC¬MD ¬D¬CMD ¬D¬C¬MD ¬Dsevere (S)lenient (L)16 182 20 60 66 06 02 216 18Table 6.5: Product data for charisma and Machiavellianism.what about expected utilities? These can be recalculated as follows. First,some probabilities:pr(D ∣ SCM) = 16/34,pr(¬D ∣ SCM) = 18/34;pr(D ∣ SC¬M) = 0;pr(¬D ∣ SC¬M) = 1;pr(D ∣ S¬CM) = 1pr(¬D ∣ S¬CM) = 0.pr(D ∣ S¬C¬M) = 0.5pr(¬D ∣ S¬C¬M) = 0.5pr(D ∣ LCM) = 0.5,pr(¬D ∣ LCM) = 0.5;pr(D ∣ LC¬M) = 0;pr(¬D ∣ LC¬M) = 1;pr(D ∣ L¬CM) = 1pr(¬D ∣ L¬CM) = 0.pr(D ∣ L¬C¬M) = 16/34pr(¬D ∣ L¬C¬M) = 18/34.This completes the specification of the model. It can be shown case bycase that relative to each block of the refined (richer) partition, Severitydominates Lenience.If CM , eu(severity) = 16/34 × 10 + 18/34 × 90 = 52 (approximately);whereas eu(lenience) = 1/2 × 0 + 1/2 × 80 = 40.143If C¬M , eu(severity) = 0 × 10 + 1 × 90 = 90, whereas eu(lenience) =0 × 0 + 1 × 80 = 80.If ¬CM , eu(severity) = 1 × 10 + 0 × 90 = 10, whereas eu(lenience) =1 × 0 + 0 × 80 = 0.Finally, if ¬C¬M , eu(severity) = 1/2 × 10 + 1/2 × 90 = 50, whereaseu(lenience) = 16/34 × 0 + 18/34 × 80 = 42 (approximately).Clearly, then, the application of STP to the combined (product) causalpartition yields severity as the rational act for Reoboam to take. Now, thecombined causal partition is informationally richer than either partition onits own. For that reason, one may choose to listen to the combined causalpartition, rather than to the informationally poorer causal representations.The point that comes out from this is that the proper application of STPcan’t just be to one causal partition or another. It has to be to the mostcomplete causal partition we can get hold of, the one that encompasses themost causal background information. Depending on the individual problem,perhaps there is one, perhaps there isn’t.In light of the phenomenon of partition sensitivity, is STP a valid form ofreasoning? Certainly, we “cheated” a bit to generate our data just to suit theneeds of our model (hence the symmetries in Table 6.5). The argument wastailor-made to press the easiest conclusions out of the data. But partitionsensitivity as such does not throw doubt on the validity of STP as a formof reasoning. I remain persuaded that STP is a valid form of reasoning.We just need better background theories in order to get the most out ofit. A theory which includes STP as a norm of decision making or rationalpreference must impose a certain constraint of causal completeness, so thatrational agents must gather the maximum amount of causal knowledge toestimate the combined (product) partition for all the relevant factors andpartitions before they engage in STP reasoning. STP is subordinate tocausal analysis. It is a moot point whether or not such a theory imposesan unreasonable level of idealization on the theoretical subject of decisiontheory. Conversely, any expectations that decision theory is an apt model144for human decision-making must be tempered by the knowledge that ourcausal understanding is most likely incomplete.6.4 ConclusionIn this chapter I have discussed two notorious problems raised by Simpson’sparadox for Savage’s Sure Thing Principle. I’ve examined an apparentlyweightier but dismissible charge of invalidity, and a seemingly lighter butineliminable charge of partition-sensitivity. The charge of invalidity statesthat reasoning in accordance with STP goes wrong in SP scenarios, relativeto the specific partition in which the correlations are reversed. The chargeof lack of partition invariance does not entail invalidity, since it does notamount to a proof that we go wrong when we reason according to a givenpartition; rather it points out that not every partition is as good as any otherwhen it comes to getting a prescription out of STP. Gibbard and Harperproposed that when causal and epistemic partitions yield incommensuraterecommendations, we should prefer the causal to the epistemic. They werecompletely right about that. Their point does not however impugn the va-lidity of the principle, but immediately its applicability, and more distantlyits position at the core of Savage-style decision theory.The troubling aspect of the argument put forward by Gibbard andHarper is that by performing legitimate derivations in accordance with STPrelative to different partitions we obtain incompatible recommendations.Nothing goes wrong with the reasoning in these competing cases, whichmeans that the threat of dissonance is all the more pressing; but in thatcase the choice of premises becomes paramount. Choosing sound premisesfor an application of STP or dominance reasoning will lead to sound rec-ommendations. One has to take responsibility for how the logical space isdivided, before one is able to apply STP in a satisfactory manner. It looksas though possession of a good degree of causal discernment and causalknowledge is a central part of decision theory. Simpson’s paradox drivesthis lesson home in a spectacular manner.145Chapter 7Unrestricted Restricted STP7.1 IntroductionAumann et al. [4] make several revisionary recommendations concerning thescope of the famous sure thing principle of decision theory due to Savage[72]. The most far-reaching of these is to restrict the application of the surething principle (STP) to disjoint possibilities. Aumann et al. [4] motivatethe restriction by means of a putative counterexample, which leads the au-thors to conclude that when STP is applied to mutually compatible events,reasoning in accordance with STP has no inherent validity and is thus notguaranteed to lead to safe conclusions about which act is optimal or choice-worthy. Against Aumann et al. [4] I argue that the counterexample failsto work. A close analysis of the structure of their argument shows thateither (a) contrary to their analysis, the prescription derivable on the basisof STP is in perfect accord with intuition, or else (b) the basic premises foran application of the sure thing principle are not met, which means that thecounterexample misfires.The sure thing principle was put forward by Savage as an extralogicalprinciple commanding a sufficient amount of intuitive justification to put iton an equal footing with axioms of a purely logical nature. In an earlierchapter, I discussed STP and the problem of partition sensitivity, as illus-trated by cases of Simpson’s paradox. I concluded that substantive causal146knowledge is required for a correct application of STP to a given decisionproblem, and hence that STP cannot have a purely logical nature. Savagewas entirely correct to feel that there was an important difference markingoff STP from his other axioms. STP must be qualified by causal knowledge.A related problem is the need to restrict STP to causally independent parti-tions. The notorious deterrence problem discussed earlier in the thesis showsthat reasoning in accordance with STP when it is in the agent’s power tocausally affect the distribution of probabilities across the partition can bespectacularly invalid. By making choices, the agent can causally influencehow likely it is that states of the world obtain, and therefore how likely itis that he will come to learn that various states obtain. If knowledge ofthe states causally depends on his action, STP cannot be validly applied topremises in which the knowledge is invoked.Thus both Simpson’s paradox and the related problem of ensuring thatthe agent is not in a position to manipulate his states of knowledge (or theprobabilities that they obtain) strongly suggest that STP should be causallyrestricted. However, the restriction proposed by Aumann et al. [4] is logicalrather than causal in nature. I argue that logical restrictions of STP areillegitimate. In short, I argue here that the causally restricted STP shouldbe logically unrestricted: a case for “unrestricted restricted STP”.7.1.1 Introducing the counterexampleThe sure thing principle in its classic form due to Savage [72] asserts that ifan agent prefers an option f over another g conditional on knowledge thatE, and also prefers f to g conditional on knowledge that E is false, thenhe ought to prefer f to g unconditionally. The nature of the principle isepistemic, relating actual and possible knowledge and decision for a rationalagent. If the agent’s ranking of preferences conforms to the STP, then theagent’s decision may be sensitive to the existence of the issue whether Eholds, without being sensitive to the specific knowledge whether E. Thismeans that the agent’s decision is free from having to depend on irrelevantknowledge, including whether or not the agent is in fact ignorant whether147E 72. In an often-encountered guise, the informal version of the STP statesthat:(Sure thing principle) If an agent prefers f to g both knowing E andknowing ¬E, then he prefers f to g knowing neither E nor ¬E.Savage did not attempt to formalize the STP in his founding work, com-ing as it did a decade before the publication of Hintikka’s Knowledge andbelief and the subsequent development of epistemic modal logic 73. Instead,the intuitive validity of such a principle was motivated by Savage by meansof a celebrated illustration. The illustration focusses on a rational agent inthe concluding stages of a process of deliberation, asking himself a pertinentquestion about what he would do if he knew the outcome of a future event,in an attempt to settle the question what he ought to do at present; theagent deliberates in anticipation but without specific foreknowledge of theevent:A businessman contemplates buying a certain piece of property.He considers the outcome of the next presidential election rele-vant. So, to clarify the matter to himself, he asks whether hewould buy if he knew that the Democratic candidate would win,and decides that he would. Similarly, he considers whether hewould buy if he knew that the Republican candidate were goingto win, and again finds that he would. Seeing that he would buyin either event, he decides that he should buy, even though hedoes not knew which event obtains, or will obtain, as we wouldordinarily say. It is all too seldom that a decision can be ar-rived at on the basis of this principle, but except possibly for theassumption of simple ordering, I know of no other extra-logicalprinciple governing decisions that finds such ready acceptance.72See Samet [70] and Samet [71] for a careful formulation of the issue of the independenceof irrelevant knowledge in relation to the STP.73Hintikka [30].148The counterexample provided by Aumann et al. [4] involves a modifica-tion of the original illustration by Savage. But whereas the original illus-tration provided by Savage was intended to provide intuitive motivation forreasoning in accordance with the STP, the rival counterexample is intendedto exhibit the limitations of STP reasoning, and therefore to provide intu-itive motivation for restricting the application of the principle to disjointevents (events with empty intersection).In line with the original example, the counterexample provided by Au-mann et al [2005] is centred on the same kind of rational agent (a busi-nessman deliberating without specific foreknowledge of an event presumedto be of relevance to his decision-making). But there are two importantdepartures from the original example. First, their story presupposes thatthere are three candidates in the race, a Republican, a Democrat, and anIndependent. And second, it is assumed that the agent is committed to aspecific buying policy which ties the decision to his own rational degrees ofbelief. That is to say, on antecedent grounds, the agent has decided that hewill buy the property if and only if his subjective probability for the eventIndependent wins is greater than 1/2. His current probabilities give the In-dependent a small lead over the others, but not enough of a lead to justifybuying the property.For convenience, let I, R, and D denote the events Independent wins,Republican wins, and Democrat wins, and let pr reflect the businessman’sdegrees of credence, with atomic probabilities pr(I) = 3/7 and pr(D) = 2/7 =pr(R). The decisions available to the agent are δ1 = buy and δ2 = do notbuy. Formally, we may think of the agent’s decision as a function from theagent’s own conditional probabilities74 to the range of options available tohim, which outputs δ1 if pr(I) > 1/2 and outputs δ2 if pr(I) ≤ 1/2. Sinceon his current probabilities, pr(I) < 1/2, it follows that his current decisionought to be δ2. But now consider the sure thing event either the Democratloses, or the Republican loses. According to Aumann et al, if the agentcomes to know that the Democrat lost, his subjective probabilities for the74More precisely, from a class of conditional probability functions determined by thelower bound condition.149event I would be his conditional probability for I given the event Democratloses, which is 3/5. Similarly, if the agent comes to know that the Republi-can lost, his subjective probabilities for the event I would be his conditionalprobability for I given the event Republican loses, which is 3/5. In eithercase, his subjective probability for I would be above 1/2, so in either casehis counterfactual decision would be δ1. Since δ2 = do not buy is not thesame decision as δ1 = buy, it follows on this argument that the agent’s actualdecision is not the decision he knows he would make if he knew which of theparty candidates will lose. Moreover, if by an application of the sure thingprinciple we get the recommendation that δ1 = buy is the decision the agentought to take, then there is an actual conflict between the recommendationof the sure thing principle and intuition. Intuition tells us that the correctdecision for the agent is δ2 = do not buy. Our only option, according toAumann et al, is to block the application of the sure thing principle in thiscontext, which they propose we do on the grounds that applications of thesure thing principle to non-disjoint events are not naturally valid.At the outset, the proposed counterexample raises several preliminaryquestions. First, note that one could take the argument to show that thereis something wrong with the way that the decision function has been de-scribed, as opposed to reasoning from STP principles. Why doesn’t theargument show that it is improper to define a decision function of this typein the context of Savage’s framework, rather than taking it to show that theframework is susceptible to counterexample? Moreover, it is clear enoughthat in general, we ought to draw a distinction between counterfactual prob-abilities as probabilities under a supposition and conditional probabilities.These are not the same thing. Why must we take it that the agent’s proba-bilities if he knew which of the two party candidates lost are identical to hisconditional probabilities on the event? And third, note that the appeal tointuition cuts both ways. It is true that intuition dictates that the correctdecision for the agent is not to buy. But it is also true that intuition dictatesthat the sure thing principle ought to be valid unrestrictedly. The proposedrestriction to safeguard intuitiveness is itself counterintuitive.150On the other hand, if the counterexample is valid, then the repercus-sions are wider than the discussion in that paper indicates. The authorsseem to think that the argument applies only to reasoning about decision,but isn’t there a similar problem concerning belief? For suppose that weintroduce a loosely constrained concept of belief, on which to believe thatp it is sufficient that the agent’s degree of belief in p is greater than 1/2 75.Then we are able to say that currently the agent believes that Independentwill lose, yet knows that no matter which other candidate loses, he wouldbelieve that the Independent will win. But it doesn’t seem that the agentought to give up his current belief, even though he knows that if he knewbetter, he would believe the opposite to what he currently believes76. Thisis akin to the alleged contradiction between intuitively correct decision andreasoning about decision on the basis of the sure thing fact that either partycandidate must lose. Except that it applies to belief, albeit to loosely con-75Such a concept of belief is commonplace in both formal and traditional philosophy.Arguably, it is the conception of belief used by Hume in his essay on miracles. See sectionX of David Hume’s An Enquiry concerning Human Understanding (Hume [32]). In formalareas it has sometimes been called comparative belief, since pr(p) > 1/2 if and only if pr(p)> pr(¬p). The comparative label serves to indicate that what is required for belief in thissense is the ability to be rationally more confident that p is true than false. It is beingable to differentiate the truth and falsehood of p that matters to the loosely constrainedconcept of belief. For a formal semantics see for example Herzig [28].76To develop these remarks involves at the very least dealing with the complexity ofexpressing the relation knows more than in a formally adequate way. The problem is asfollows. To express the relation of better knowledge, one needs an adequate epistemicmodal logic and semantics. The modal logic of the knowledge operator is usually takento be S5. The S5 modal system validates the so-called negative introspection axiom,according to which if the agent doesn’t know a proposition p, he knows that he doesn’tknow p. But it is obviously impossible to know that you don’t know p when you do infact know p. Thus an agent who doesn’t know which situation obtains knows that hedoes not know which; and the agent who does know which fails to know that he doesnot know which. In effect, reasoning in accordance with S5 guarantees that there is no“better” knowledge which is not at the same time “worse” kowledge as well, albeit thatthe source of this phenomenon is higher-order. As Moses and Nachum [58] page 156 put it,“taking the union of states of knowledge in which the agent has differing knowledge doesnot result in a state of knowledge in which the agent is more ignorant; it simply does notresult in a state of knowledge at all!”. Indeed, if the ignorant were to ask himself what hewould do if he had the same information as the more knowledgeable version of himself, hewould be asking a contradictory question; for if he were to know what the latter knows, hewould find himself believing contradictory propositions. See Tarbush [78] for a syntacticalapproach to the problem.151strained belief, and not to stronger concepts of belief such as belief only ofpropositions with probability one.It is important to stress that ultimately Aumann et al. [4] think thatthe counterexample should be rejected. They reject it not because theythey think the counterexample fails to work, but because in their view itdoes work, and only too well. I agree that the counterexample should berejected. But I think that it should be rejected because it does not work. Ithink a simple Bayesian analysis shows that the counterexample is just badreasoning, not bad reasoning on the basis of the sure thing principle, butordinary bad reasoning which fails to represent the puzzle correctly. Nothingin the alleged counterexample even begins to touch the validity of the surething principle. As we shall see, when the problem is properly represented,there is no argument from the sure thing principle to the conclusion thatthe agent ought to buy.Let’s address the preliminary questions one by one. Note that Savage’sformal framework distinguishes between states of affairs, actions, and out-comes. An action is an abstract function from a state of affairs to an outcomeor consequent state, which as Joyce [36] observes, is perfectly intuitive sinceactions do indeed set up a functional correspondence between a prior stateand an ulterior consequence 77. But the idea of a decision-making policyrelating a future plan for action to a sufficient degree of confidence that aspecific event obtains doesn’t obviously fit into the framework. Why not?Essentially because the formulation of the sure thing principle resorts toepistemic terms, whereas the argument given by Aumann et al point ustowards a different conception in terms of having or investing a sufficientdegree of confidence in an event. Savage asked the question what would theagent do if he knew that an event obtained? Aumann et al practically askthe question what would the agent do if his degree of rational confidencewere above a certain threshold value? But there’s no a priori reason to thinkthat a formulation which works for epistemic modalities obligatorily has towork for weaker, non-epistemic modalities. It’s possible until proof to the77p 50, Joyce [36]. As he remarks, “the woman who chooses to walk to work on a cloudyday really does make her happiness a function of the weather”.152contrary that where the counterexample goes wrong is by importing an illicitextracurricular assumption into an otherwise perfectly sound framework.In general, if an argument relies on a special assumption, it is a goodidea to query the special assumption first, before calling into question theformal reasoning which proceeds from it. Now for the second question. Arewe entitled to equate the agent’s probabilities under the supposition thathe knows which party’s candidate lost, with his probabilities conditional onthis information? Intuitively, it seems that we should identify the agent’sprobabilities under the knowledge supposition with probabilities conditionalon the agent’s total information, not with probabilities conditional on theevent that the Democrat (or else the Republican) lost. For suppose thatin actual fact, the agent goes on to learn a logically stronger proposition,which entails that the Democrat lost. Then if we condition on just the latterinformation, we are ignoring relevant information in the agent’s possession.But that means that the probabilities we obtain are not the agent’s ownprobabilities, since these are inclusive of everything that the agent knows.On the other hand, if we identify the agent’s probabilities with probabilitiesconditional on the total evidence available to the agent (as it seems we must)the argument clearly does not go through. For suppose that the agent knowsthat the Democrat lost in virtue of knowing that he lost to the Republican,which in turn entails that the Independent also lost the race. Then hisprobability for the Independent winning is zero, assuming he trusts thisinformation. Similarly, suppose that he knows that the Republican lost byderivation from the known fact that the Democrat won, which entails thatIndependent assuredly lost as well. Then his probability for the Independentwinning would again be zero. The first scenario demonstrates that it iswrong to assume that if the agent knew that the Democrat has lost, hisprobability for Independent winning would be above 1/2. The second, thatit is wrong to assume that if he knew that the Republican lost, his probabilityfor the Independent winning would be above 1/2. At most, we can agreethat if the agent knew which party candidate lost, his relevant probabilitiesmight be above 1/2, not that they would be above 1/2. But the weakerpremises clearly fall short of what is required to make up an argument from153the sure thing principle.How can we ensure we get the stronger premises needed for the coun-terexample to work as smoothly as possible? We could require, as Aumannet al. [4] themselves suggest, that the knowledge hypothetically attributedto the agent is just the knowledge of which party candidate lost, and notlogically stronger knowledge of which party candidate won. Then the ar-gument that the agent’s probabilities for the Independent winning mightwell be zero, depending on what else he learns in addition to learning theidentity of the losing party candidate, is blocked 78. But that means thatour formal model by means of which we are to make sense of the counterex-ample requires us to distinguish more structure than just what the atomicpropositional variables (and Boolean combinations thereof) allow. We needa framework which is apt as far as the key restriction on what is knowableby the agent is concerned. I will return to this crucial point below.Concerning the third matter arising for this argument, I note that there’sno obvious reason why reasoning in accordance with the sure thing princi-ple should be restricted to possibilities which are mutually incompatible.There doesn’t seem to be anything pertaining to the nature of the surething principle that would warrant the restriction. Why should it dependon the possibilities in question having an empty intersection? Where wouldsuch a restriction come from? Suppose that it is so late in the evening thatonly two buses may still come tonight. If one came, it would be of no use tome, and if the other came, it would also be of no use to me, from which itfollows that I ought to make alternative arrangements to get home tonight,without hanging about in the rain to see which one of them turns up. Whyshould the validity of this argument be impugned by noticing that it’s per-fectly possible that both buses may come at once? Savage himself thought78It seems odd that Aumann et al don’t treat the no stronger knowledge requirement ona par with the logical observation that the event space isn’t partitional since “Republicanlost” and “Democrat lost” are mutually compatible events. For all that is apparent,they might as well have argued that the application of the sure thing principle should berestricted to cases in which there is no need to limit what the agent can know i.e. thatunrestricted knowledge or access to information is a precondition of the correct applicationof sure thing reasoning.154of the sure thing principle as an “extra-logical” assumption, not quite aspure as to be defended on the basis of logic alone. But this doesn’t meanthat he thought it could be restricted in a major way without a serious lossof reasoning power to his framework. Fortunately, as I will now argue, thereis no need to countenance restricting the principle, as it is under no threatfrom the proposed type of counterexample.7.2 Modelling the knowledge restrictionLet us start with the basics. Let w1, w2, and w3 refer to a a world inwhich the Democrat, the Independent, and the Republican wins, respec-tively. We may take it that the collection W = {w1,w2,w3} represents theepistemically possible worlds for the agent at the beginning of the story.The probability function pr that the agent starts out with can be construedas a function over the algebra determined by W , with atomic probabili-ties pr(w1) = pr(w3) = 2/7, and pr(w2) = 3/7. Note that pr({w1,w2}) =pr({w2,w3}) = 5/7. Moreover, if we condition the event that the Indepen-dent wins on either of these two events, we obtain:pr({w2}) ∣ ({w1,w2})= pr({w2} ∣ ({w2,w3}) = 3/5This immediately suggests a way to represent the central claim that ifthe agent knew which party candidate had lost, his probabilities for the In-dependent winning would be 3/5. Ideally, we would be able to representwhat the agent can know in each state of the world as a function from thestate of the world to one of the two subsets {w1,w2} or {w2,w3}, which intu-itively speaking correspond to “ulterior” states in which the agent is able toeliminate just the world in which the corresponding party candidate loses,and no more. This would be an elegant and efficient way of capturing theno stronger knowledge condition. The functional conception makes perfectsense in w1 or w3; we may imagine that a world in which one party can-didate loses to another party candidate is a world in which the agent findsout the identity of the losing party candidate. But the “function” breaks155down over the middle world w2, since we could map a world in which theIndependent wins just as well to {w1,w2} as to {w2,w3}. Without prejudiceto either mapping, essentially here we face a choice between two functionsf1 ∶W → ℘W and f2 ∶W → ℘W such that:f1(w2) = {w1,w2}f2(w2) = {w2,w3}i.e. between two different functions which agree on what the agent mayknow in w1 and w3 but disagree over what he may know in w2.Note that the problem isn’t merely formal. It is also intuitive. Forimagine that we’re in a world in which the Independent goes on to win theelection. Then even if we assume that the agent goes on to learn just thatthe Democrat loses, or just that the Republican loses, there is a degree ofuncertainty as to which of these facts the agent will go on to learn in goodtime. This uncertainty is just what the formal problem is all about. Ourfunctions f1 and f2 are equally good ways to represent restricted knowledge,but opting for either over the other is invidious, since nothing in the natureof the problem tells us that f1 is better than f2, or vice-versa.One obvious way to get around this problem is by applying the usualCartesian expansion on the original logical space. Let us expand the modelby introducing a new model-theoretic entity which is an ordered pair ofworld and what the agent may come to know or more simply function. If weform the Cartesian product ofW with F = {f1, f2} we obtain a new universeof the model, containing:W11 = {w1, f1}W12 = {w1, f2}W21 = {w2, f1}W22 = {w2, f2}W31 = {w3, f1}W32 = {w3, f2}156Let W = {W11,W12,W21,W21,W31,W32}. Note that the indexes aremeant to provide a mnemonic to show both who wins in actual fact, andwhat the agent knows, in each new world. For example, W11 is a world inwhich in actual fact the Democrat wins, and the knowledge function is suchthat the agent finds out that the Republican lost if the Democrat wins ORif the Independent wins, and finds out that the Democrat lost only if theRepublican wins; etc.Next, we face the task of reinterpreting the old probability functionpr defined over ℘W into a unique probability function Pr defined over ℘W. Unfortunately, the information we have about pr is insufficient for thedetermination of a unique probability function. That is because the soleconstraints that pr contributes to the determination of Pr are:Pr({W11,W12)} = 2/7Pr({W21,W22)} = 3/7Pr({W31,W32)} = 2/7So let’s impose an additional constraint that will determine a uniqueprobability function together with the above. Lets agree that:When the Independent wins, it is equiprobable that the agent comes toknow that the Democrat loses, or comes to know that the Republican loses.79Then we get the following fully defined function with atomic probabili-ties as follows.Pr({W11}) = 2/14Pr({W12}) = 2/1479We don’t have to impose this constraint, in fact below I discuss the more general casein which it is liberalized. The counterexample still fails on the liberalized construal, as Iwill subsequently show.157Pr({W21}) = 3/14Pr({W22}) = 3/14Pr({W31}) = 2/14Pr({W32}) = 2/14.Now take the key claim that if the agent learns that the Democrat loses,his probability that the Independent wins would be 3/5. Is this true? No, it isnot. The probability that the Independent wins is unchanged. Howso? It is easy to verify that the event that the agent comes to know that theDemocrat loses is the set {W22,W31,W32} since these are the worlds in whichthe agent actually find out that the Democrat loses. Calculating the proba-bility of this event gives us that Pr({W22,W31,W32}) = 3/14+ 2/14+ 2/14 =1/2. Next, note that event Independent wins is the set {W21,W22}, andmoreover its probability is Pr({W21,W22}) = 3/7. Intersecting it with theformer knowledge event gives us the set {W22}. This is the only world inwhich the Independent wins and the agent comes to find out that the Demo-crat lost (and no more, by permission of our model). Thus we obtain thatthe probability that the Independent wins given that the agent leans thatthe Democrat loses, and nothing else is:Pr({W21,W22} ∣ ({W22,W31,W32})= Pr({W22})/Pr({W22,W31,W32})= (3/14)/(1/2) = 3/7.So the unconditional probability of the event that the Independent wins,and its probability conditional on sole knowledge that the Democrat lost areequal. A similar calculation gives us the same result for conditioning on theknowledge that the Republican lost.Thus the proper representation of the problem demonstrates that infor-mation about which party candidate lost is irrelevant to the agent. Farfrom its being the case that the agent’s probability would be above 1/2, we158find that the agent’s probability would be unchanged, if he knew just whichparty candidate lost.One might object that the probabilistic irrelevance conclusion is an arte-fact of the assumption that in a world in which the Independent wins, theagent is equally likely to find out that the Democrat lost (and no more), orto find out that the Republican lost (and no more). In a moment, I willliberalize this condition, but before that, I want to note that the argumentI gave above is a Bayesian argument in disguise. So let’s give the argumentagain, but making sure that its Bayesian nature is manifest.To do this we will reuse our old sentential variables I, D, and R, andintroduce several new formulae prefixed by a new syntactical operator !,namely !¬D and !¬R. The exclamation mark indicates that it is revealedto the agent that a certain fact obtains. For example, !¬D means that itis truthfully revealed to the agent in due course that the Democrat lost 80.The operative notion is the notion that a certain fact is truthfully revealedto the agent conditional on certain states of the world. We will assume thatafter a public revelation the agent knows the content of what is revealed tohim. Let us define a new probability function µ in line with our exampleas follows. µ(D) = µ(R) = 2/7 and µ(I) = 3/7. Moreover, let µ(!¬D ∣D) = µ(!¬R ∣ R) = 0 since the revelation must be truthful, and stipulatethat µ(!¬D ∣ I) = µ(!¬R ∣ I) = 1/2, corresponding to the equiprobabilitycondition, here interpreted as saying that if the Independent wins, it is aslikely to be announced that the Democrat lost as it is that the Republicanlost. Now let’s consider the agent’s prior and posterior probabilities. Hisprior probability for the Independent winning is µ(I) = 3/7. His expectationat the earlier time that it will be announced that the Democrat loses isµ(!¬D) = µ(!¬D ∣ I)×µ(I)+µ(!¬D ∣ R)×µ(R) = 1/2×3/7+1×2/7 = 1/2. Bythe Bayesian formula that pr(H ∣ E) = pr(E ∣H)× pr(H)/pr(E), we obtainthat the probability that the Independent wins given that it is announcedthat the Democrat loses is µ(I ∣!¬D) = µ(!¬D ∣ I) × µ(I)/µ(!¬D) = 3/7.The agent’s probability is unchanged. A similar calculation (which I80See Van Benthem [79] chapter 15 for an introduction to the language and logic ofpublic announcements.159will omit) works for !¬R. Thus a simple Bayesian analysis shown that nomatter what we suppose the agent learns, the information about the losingparty candidate is irrelevant as far as the chances of the Independent areconcerned.What if we don’t assume that if the Independent wins, it is as likely tofind out that the Democrat lost as it is to find out that the Republican lost?Does this broader move secure the counterexample? I will now show thatif we liberalize this assumption, then no matter how we liberalize it oneor the other premise is false. Either when the agent learns that theDemocrat loses, or when he learns that the Republican loses, his probabilityfor the Independent losing must be below 1/2. If one of these statementsmust be false, then there is no argument from the sure thing principle tothe counterintuitive conclusion that the agent must buy now, in the presenttense. In other words there is no counterexample to the sure thing principle.Not, one might add, even the ghost of one.Suppose that we replace the equiprobability assumption with the con-dition that µ(!¬D ∣ I) + µ(!¬R ∣ I) = 1. Whenever the Independent wins,there’s no option not to reveal the identity of one of the losing candidates.Let µ(!¬D ∣ I) = a, entailing that µ(!¬R ∣ I) = 1 − a, and moreover supposethat a < 1 − a. Suppose that the agent’s evidence is !¬D. By the Bayesformula µ(I ∣!¬D) = µ(!¬D ∣ I) × µ(I)/µ(!¬D) = [(a × 3/7]/[2/7 + 3/7 × a].But since a < 1 − a that means that µ(I ∣!¬D) < µ(I) = 3/7. Now supposethat 1−a < a, and let the agent’s evidence be !¬R. By another application ofthe Bayesian formula we obtain that µ(I ∣!¬R) = µ(!¬R ∣ I)×µ(I)/µ(!¬R) =[(1 − a) × 3/7]/[2/7 + 3/7 × (1 − a)]. But since 1 − a < a, then it follows thatµ(I ∣!¬R) < µ(I) = 3/7. Putting this together we can see that no matterwhat value we let a take, there will be a learning scenario in which the agentlearns just the identity of the party candidate, and yet his posterior prob-abilities are even lower than they are to begin with. No counterexample tothe sure thing principle is generated, because the sure thing principle doesnot apply. On the liberalized account, it is not a sure thing that theagent’s probabilities will be above 1/2.To sum up the argument, I have argued that Bayesian analysis shows the160following. No matter how we parse the problem, either it is a sure thing thatthe information about which party candidate loses is irrelevant to the agent,or it is not a sure thing that the agent’s probabilities pass the thresholdvalue. If the information is irrelevant probabilities stay where the are, andthe sure thing principle correctly predicts that the agent shouldnot buy. If the sure thing premise is not satisfied, we may conclude notthat the sure thing reasoning has been invalidated, but that the attemptto apply the sure thing principle has misfired.7.3 ConclusionThere is a family resemblance between the epistemic arguments based onSimpson’s paradox examined earlier in the thesis, and the argument againstSTP in Aumann et al. [4]. Recall that deviant belief as defined by Mortonis rational but systematically defeasible by better knowledge. The notion ofbelief compatible with the analysis in Aumann et al. [4] is deviant in a deepersense: it is rational belief which in future would have to be disbelieved nomatter what the agent were to go on to learn. I have argued that their analy-sis is wrong, and hence that there is no reason to suppose that rational beliefcan undergo such radical shifts in light of expected knowledge. Collectingboth conclusions in one, my view is that no case has yet been successfullymade for expanding epistemology to include “rational deviance” either ofjustification or of belief.161Chapter 8ConclusionsSimpson’s paradox started out as a curious phenomenon in statistics. Itfiltered down into philosophy gradually, where it was first raised to generalconsciousness through the efforts of Cartwright, Skyrms, Eells, and othersto pin down its implications for the theory of probabilistic causality. Pearl’smagisterial opus on causation did the most to demystify causal notions andcharacterize the autonomous logic of causal concepts and causal reasoning.To date, Pearl’s causal account of Simpson’s paradox remains the mostconvincing and systematic characterization of the phenomenon.I have argued that Simpson’s paradox poses considerable difficulties forthe probabilistic analysis of evidence. That there was a difficulty about evi-dence in contexts with Simpson’s paradoxical probabilities was first spottedby Morton. His argument about deviant belief traces a familiar difficulty forthe probabilistic theory of incremental confirmation, going back all the wayto Carnap, which is that sometimes the consistency of certain sets of proba-bilistic statements implies a corresponding set of evidential judgments whichseem counterintuitive. According to Morton, the challenge for epistemologyis to come to terms with the idea that an item of evidence may confirm ahypothesis on one’s present evidence, yet fail to confirm it relative to futureevidence, no matter what one learns about it. Simpson’s paradoxical evi-dence implies that we need to acknowledge the possibility that a belief maybe both justified and deviant with respect to future (or to counterfactual)162courses of additional evidence.However, I have argued that Morton misidentifies the problem of Simp-son’s paradoxical evidence, which involves causal knowledge and causal rea-soning. I have argued that Morton’s problem is relatively easy to solve, oncewe distinguish between causal and news-value senses of evidence. Accord-ing to my proposal, evidence in a causal sense traces causal probabilities,whilst evidence in a news-value sense traces overall relevance. I have arguedthat appealing to this overlooked distinction helps make sense of the twodifferent aspects in the force of evidence in Simpson’s paradox. Moreover,I have argued that a fuller epistemology of rational belief ought to draw adistinction between causal degree of belief in Y given X as the degree towhich a rational agent ought to be confident that Y may be brought aboutas a consequence of bringing about X; and news value degree of belief tobe the degree to which information about X provides news or informationabout Y . It remains to be seen in future work how well this distinction canbe implemented within a broadly Bayesian representation of rational belief.One of the main lessons emerging from my discussion of the eviden-tial implications of Simpson’s paradox concerns the role played by causalknowledge in formal epistemology. In my view, there is a deep and largelyunacknowledged problem in Bayesian epistemology, which requires rationaldegrees of belief to be represented as subjective probabilities, or degreesof confidence that somehow manage to reflect the total evidence, which iseverything that the agent knows, in his credal state. But a probabilityfunction (or set thereof) cannot represent everything that the agent knows,if the agent possesses causal knowledge. As has frequently been pointedout in the formal causal literature, causal structure is underdetermined byprobabilistic information. Causal knowledge cannot be encoded in terms ofconditional probabilities, but requires probabilities under an intervention.Yet it seems that the agent’s full epistemic state ought to include causalknowledge (represented, for example, as a set of possible causal models). Itseems surprising that a lesson so widely known in causal analysis has re-ceived so little attention in the formal epistemology literature. A systematicaccount and investigation of this problem is a topic for future research.163The second important application of Simpson’s paradox is to decisiontheory. Here too the general discussion does not appear to pay much heedto Simpson’s paradox, but there are exceptions to this trend. In other placesthe probabilistic identity between one form of Simpson’s paradox and New-comb’s paradox has been noted. In several others, there have been attemptsto apply graph-based causal analysis to Newcomb’s paradox, yielding for-mal structural correspondences between Newcomb’s paradox and Simpson’sparadox. The point that emerges is that Simpson’s paradox and Newcomb’sparadox are both causally and probabilistically equivalent, which impliesthat resolving the former ought to resolve the latter. Yet little consensushas emerged about the implications of a systematic causal analysis for de-cision theory. This too is a further topic for investigation.In this thesis I have looked at one specific and powerful consequence ofthe possibility of iterated reversals of probabilistic association (a multistageSimpson’s paradoxical machine) for decision theory in the form due to Sav-age, to which both the evidential and the causal forms of decision may betraced. Here too the role played by causal knowledge has been underesti-mated. It has been known since the early days when Savage first proposedhis theory of rational decision making that the application of the theorywas partition-sensitive. Gibbard and Harper contributed the crucial obser-vation that Savage’s proposed solution in terms of independence of the actpartition must be understood as a requirement of causal independence, notmerely epistemic independence. But Gibbard and Harper failed to noticethe problem that partition sensitivity affects causally independent partitionsas well. Extrapolating their own example, I showed that one can producea series of causally independent state partitions, each essentially exhibitingSimpson’s reversals, which clash with each other in terms of which act theyrecommend. I suggested that the solution is to appeal to causal knowledgeitself, as the determinant of the “unique” best partition (the maximally finecausal partition), relative to which the theory must be applied. A full dis-cussion of the implications of causal knowledge for decision theory is a topicfor further research.164Bibliography[1] P. Achinstein. The Concept of Evidence. Oxford University Press,1983. → pages 21[2] O. A. Arah. The role of causal reasoning in understanding Simpson’sparadox, Lord’s paradox, and the suppression effect: covariateselection in the analysis of observational studies. Emerging themes inepidemiology, 5:5, Jan. 2008. → pages 32[3] F. Arntzenius. No regrets, or : Edith Piaf revamps decision theory.Erkenntnis, 68(2):277–297, 2008. → pages 91, 121[4] R. J. Aumann, S. Hart, and M. Perry. Conditioning and thesure-thing principle. Discussion Paper Series. Centre for the Study ofRationality, 2005. → pages 24, 25, 125, 146, 147, 149, 152, 154, 161[5] P. S. Bandyopadhyay, D. Nelson, M. Greenwood, G. Brittan, andJ. Berwald. The logic of Simpson’s paradox. Synthese, 181(2):185–208, Sept. 2011. → pages 8, 9, 14, 15, 16, 80, 81, 82[6] S. Benferhat. Interventions and belief change in possibilistic graphicalmodels. Artificial Intelligence, 174(2):177–189, Feb. 2010. → pages 180[7] T. Blanchard and J. Schaffer. Cause without Default. In J. Beebee,C. Hitchcock, and H. Price, editors, Making a Difference, pages 1–29.Oxford, 2013. → pages 52, 53[8] C. R. Blyth. On Simpson’s Paradox and the Sure-Thing Principle.Journal of the American Statistical Association, 67:364–366, 1972. →pages 7, 25, 87, 121[9] R. Carnap. Logical Foundations of Probability. University Of ChicagoPress, Chicago, 1952. → pages 20, 62, 84165[10] N. Cartwright. Causal Laws and Effective Strategies. Nous, 13(4):419,1979. → pages 9, 49, 56[11] M. Cohen and E. Nagel. An Introduction to Logic and the ScientificMethod. New York: Harcourt, Brace and Company, 1934. → pages 3[12] E. Eells. Rational decision and causality. Cambridge University Press,1982. → pages 105, 106[13] E. Eells. Cartwright and Otte on Simpson’s Paradox. Philosophy ofScience, 54:233, 1987. → pages 46, 56[14] E. Eells. Probabilistic Causality. Pacific Philosophical Quarterly, 61:50–74, 1991. ISSN 02790750. doi: 10.1017/CBO9780511570667. URLhttp://books.google.com/books?id=SvP_JWUcNgMC. → pages 49[15] E. Eells and B. Fitelson. Symmetries and Asymmetries in EvidentialSupport. Philosophical Studies: An International Journal forPhilosophy in the Analytic Tradition, 107(2):129, 2002. ISSN0031-8116. → pages 21[16] E. Eells and E. Sober. Common Causes and Decision Theory.Philosophy of Science, 53(2):223, 1986. → pages 26[17] B. Fitelson. Contrastive Bayesianism. In M. Blaauw, editor,Contrastivism in Philosophy, pages 64–88. Routledge, Abingdon, UK,2013. → pages 20[18] D. Galles and J. Pearl. Axioms of causal relevance. ArtificialIntelligence, 97(97):9–43, 1997. → pages 42[19] D. Galles and J. Pearl. An Axiomatic Characterization of CausalCounterfactuals. Foundations of Science, 3:151–182, 1998. → pages 42[20] P. Gardenfors. Belief Revisions and the Ramsey Test for Conditionals.Philosophical Review, 95:81, 1986. → pages 180[21] P. Gärdenfors. Konwledge in Flux. Modeling the Dynamics ofEpistemic States. MIT Press, 1988. → pages 180[22] A. Gibbard and W. L. Harper. Counterfactuals and Two Kinds ofExpected Utility. In W. Harper, R. Stalnaker, and G. Pearce, editors,Foundations and Applications of Decision Theory (I): TheoreticalFoundations, volume 1, pages 125–162. Reidel, Dordrecht, 1978. →pages 25, 51, 105, 122166[23] I. J. Good. On the principle of total evidence. British Journal for thePhilosophy of Science, 17(4):319–321, 1967. → pages 62[24] I. J. Good and Y. Mittal. The Amalgamation and Geometry ofTwo-by-Two Contingency Tables. The Annals of Statistics, 15(2):694–711, 1987. → pages 17[25] S. Greenland. Absence of confounding does not correspond tocollapsibility of the rate ratio or rate difference. Epidemiology, 7(5):498–501, 1996. → pages 40[26] V. G. Hardcastle. Partitions, probabilistic causal laws, and Simpson’sparadox. Synthese, 86:209–228, 1991. → pages 56[27] M. a. Hernán, D. Clayton, and N. Keiding. Simpson’s paradoxunraveled. International Journal of Epidemiology, 40(3):780–785,June 2011. → pages 6, 32, 36, 40[28] A. Herzig. Modal probability, belief, and actions. FundamentaInformaticae, 2711:62–73, 2003. → pages 151[29] G. Hesslow. Two Notes on the Probabilistic Approach to Causality.Philosophy of Science, 43(43):290, 1976. → pages 46[30] J. Hintikka. Knowledge and belief. Cornell University Press, NY,1992. → pages 148[31] C. Hitchcock. A Tale of Two Effects. Philosophical Review, 110(110):361–396, 2001. → pages 53[32] D. Hume. An Inquiry Concerning Human Understanding. Macmillan,2008. → pages 151[33] R. Jeffrey. The Logic of Decision, 2nd ed. Synthese, 48(3):473–492,1983. → pages 131[34] R. Jeffrey. The Logic of Decision, 2nd ed. University Of ChicagoPress, 2nd edition, 1983. → pages 27, 50, 105, 106[35] R. C. Jeffrey. Causality and the logic of decision. PhilosophicalTopics, 21:139–152, 1993. → pages 27[36] J. M. Joyce. The Foundations of Causal Decision Theory (CambridgeStudies in Probability, Induction and Decision Theory). Cambridge167University Press, 1 edition, 1999. → pages 50, 51, 104, 105, 121, 124,152[37] J. M. Joyce. How Probabilities Reflect Evidence. PhilosophicalPerspectives, 19:153–178, 2005. → pages 14, 63[38] J. M. Joyce. Are Newcomb problems really decisions? Synthese, 156(3):537–562, Apr. 2007. → pages 26, 106, 118[39] J. M. Joyce. Regret and instability in causal decision theory.Synthese, 187(1):123–145, Oct. 2012. → pages 91, 106, 131[40] H. Katsuno and O. Mendelzon. On the difference between updating aknowledge base and revising it. In Principles of KnowledgeRepresentation and Reasoning: Proceedings of the SecondInternational Conference, pages 387–394. Morgan Kaufmann, 1991. →pages 180[41] T. Kelly. Evidence: Fundamental Concepts and the PhenomenalConception. Philosophy Compass, 3(5):151–167, 2008. → pages 63[42] H. Kyburg. Comment. Philosophy of Science, pages 147–51, 1965. →pages 78[43] S. Lauritzen. Graphical Models. Clarendon Press, Oxford, 2010. →pages 54[44] D. Lewis. Causation. Journal of Philosophy, 70:556–567, 1973. →pages 116[45] D. Lewis. Counterfactuals. Blackwell, Oxford, 1973. → pages 116, 179[46] D. Lewis. Prisoners’ Dilemma Is a Newcomb Problem. Philosophy andPublic Affairs, 14:235–240, 1978. → pages 118[47] D. Lewis. Causal decision theory. Australasian Journal of Philosophy,59:5–30, 1981. → pages 105, 118, 121[48] D. Lindley. Seeing and doing: The concept of causation. InternationalStatistical Review, 70(2):191–197, 2002. → pages 71, 180[49] D. V. Lindley and M. R. Novick. The Role of Exchangeability inInference. The Annals of Statistics, 9(1):45–58, 1981. → pages 18, 19,35, 36, 37, 54168[50] G. Malinas. Simpson’s paradox and the wayward researcher.Australasian Journal of Philosophy, 75(3):343–359, 1997. → pages 121[51] G. Malinas. Simpson’s Paradox: A Logically Benign, EmpiricallyTreacherous Hydra. The Monist, 84(2):265–283, 2001. → pages 121,122[52] G. Malinas and J. Bigelow. Simpson’s paradox. In E. N. Zalta, editor,The Stanford Encyclopedia of Philosophy. The Metaphysics ResearchLab, Stanford University, 2009 version. → pages 82[53] A. McLaughlin. Rationality and Total Evidence. Philosophy ofScience, 37(2):271–278, 1970. → pages 62[54] C. Meek and C. Glymour. Conditioning and Intervening. The BritishJournal for the Philosophy of Science, 45(4):1001–1021, 1994. → pages26, 27[55] P. Menzies and H. Price. Causation As a Secondary Quality. TheBritish Journal for the Philosophy of Science, 44(2):187–203, 1993. →pages v, 69, 70, 73, 74, 77[56] Y. Mittal. Homogeneity of Subpopulations and Simpson’s Paradox.Journal of the American Statistical Association, 86(413):167–172,1991. → pages 17[57] A. Morton. If You’re So Smart Why Are You Ignorant? EpistemicCausal Paradoxes. Analysis, 62(2):110, 2002. → pages 22, 25, 28, 102,106, 109, 120[58] Y. Moses and G. Nachum. Agreeing to Disagree After All ( ExtendedAbstract ). In R. Parikh, editor, Theoretical aspects of reasoning aboutknowledge: proceedings of the third conference, pages 151–168. MorganKaufmann, San Francisco, 1990. → pages 151[59] R. Nozick. Newcomb’s Problem and Two Priciples of Choice. InN. Rescher, editor, Essays in Honour of Carl G Hempel, pages107–133. Springer, 1969. → pages 105[60] R. Otte. Probabilistic Causality and Simpson’s Paradox. Philosophyof Science, 52:110, 1985. → pages 46[61] R. Otte. Probabilistic Causality and Simpson’s Paradox. Philosophyof Science, 52(1):110, 1985. → pages 56169[62] J. Pearl. Causality: models, reasoning, and inference. CambridgeUniversity Press, 2nd edition, 2000. → pages 12, 28, 32, 33, 40, 42, 54,57, 70, 84, 117, 180[63] J. Pearl. The Curse of Free-will and the Paradox of Inevitable Regret.UCLA Cognitive Systems Laboratory, (December):1–4, 2010. → pages131[64] J. Pearl. Understanding Simpson’s Paradox. The AmericanStatistician, 68(1):1–12, Feb. 2014. → pages 29, 31, 32, 34, 40, 43[65] K. Pearson. Mathematical Contributions to the Theory of Evolution.VII. On the Correlation of Characters not Quantitatively Measurable,1900. ISSN 1364-503X. → pages 3[66] W. Salmon. Statistical Explanation and Statistical Relevance.University of Pittsburg Press, Pittsburgh, 1971. → pages 78[67] W. Salmon. Scientific Explanation and the Causal Structure of theWorld. Explanation, pages 78–112, 1984. → pages 78[68] W. C. Salmon. Confirmation and relevance, volume 64. University ofMinnesota Press, 1975. → pages 21, 85, 86[69] W. C. Salmon. Minnesota Studies in the Philosophy of ScienceVolume XII Scientific Explanations, volume 3. University ofPittsburgh Press, 1989. → pages 78[70] D. Samet. The sure-thing principle and independence of irrelevantknowledge. Knowledge Creation Diffusion Utilization, pages 1–16,2008. → pages 91, 148[71] D. Samet. Agreeing to disagree: The non-probabilistic case. Gamesand Economic Behavior, 69(1):169–174, May 2010. → pages 91, 148[72] L. Savage. The foundations of statistics. New York: John Wiley andsons, 1955. → pages 12, 28, 50, 146, 147[73] E. Simpson. The interpretation of interaction in contingency tables.Journal of the Royal Statistical Society: Series B (Methodological), 13(2):238–241, 1951. ISSN 00359246. → pages 3[74] B. Skyrms. Causal Necessity. Yale University Press, 1980. → pages14, 56, 58, 63, 64170[75] B. Skyrms. Causal decision theory. Australasian Journal ofPhilosophy, 59(11):5–30, 1981. → pages 105, 118[76] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, andSearch, Second Edition (Adaptive Computation and MachineLearning). A Bradford Book; second edition, 2001. → pages 180[77] P. Suppes. Probabilistic inference and the concept of total evidence.Technical Report no 94, Institute for Mathematical Studies in theSocial Sciences, Stanford University, 1966. → pages 62[78] B. Tarbush. Agreeing to disagree: a syntactic approach. MunichPersonal RePEc Archive, 2011. → pages 151[79] J. Van Benthem. Modal Logic for Open Minds. CSLI Stanford, 2010.→ pages 159[80] B. van Fraassen. The Scientific Image. Clarendon Press, Oxford,1980. → pages 78, 79[81] B. van Fraassen. Belief and the Will. The Journal of Philosophy, 81(5):235–256, 1984. → pages 92[82] C. G. Wagner. Simpson’s paradox and the Fisher-Newcomb problem.Grazer Philosophische, 40:185–194, 1991. → pages 8, 9, 25, 27[83] C. H. Wagner. Simpson’s Paradox in Real Life. The AmericanStatistician, 36(1):46–48, 1982. → pages 13[84] L. Wasserman. All of Statistics : A Concise Course in StatisticalInference Brief Contents, volume C. Springer, New York, 2004. →pages 14, 32[85] P. Weirich. Decision instability. Australasian Journal of Philosophy,63(4):37–41, 2006. → pages 131[86] T. Williamson. Knowledge and its Limits. Oxford University Press,2000. → pages 105[87] J. Woods. Paradox and Paraconsistency. Cambridge University Press,2003. → pages 10[88] J. Woodward. Making Things Happen: A Theory of CausalExplanation. Oxford University Press, Oxford, 2003. → pages 117171[89] G. U. Yule. On the Association of Attributes in Statistics: WithIllustrations from the Material of the Childhood Society, &c, 1900. →pages 3[90] J. Zidek. Maximal Simpson-Disaggregations in 2x2 Tables.Biometrika, 71(1):187–190, 1984. → pages 17172Appendix AConditioning and InterveningWhat is the difference between conditioning and intervening? As a famil-iar part of probability theory conditioning needs no introduction. Let Prbe a joint probability distribution defined over n discrete random variablesX1, ...,Xn, where each variable Xi takes values in a domain dom(Xi) ={x1i , ..., xki }. If V is an arbitrary subset of X1, ...,Xn, we shall use the nota-tion v⃗ to refer to any possible configuration of values for the variables in V .To condition Pr on the event that Xi = xi in order to find out the ensuingmarginal conditional probability of Xj = xj , we apply the usual rules forBayesian conditioning and summation over additional variables to obtain:Pr(xj ∣ xi) =∑v⃗ Pr(v⃗, xj , xi)∑v⃗,j Pr(v⃗, xj , xi), (A.1)where V as required by (A.1) refers to variables in the domain of Pr otherthan Xi and Xj . The formula above is a simple and familiar consequence ofthe usual way in which conditional probability is defined, namely:Pr(x ∣ y) =Pr(x, y)Pr(y)(A.2)together with the equally familiar formula for computing marginal prob-ability, which in the simplest situation involving just two variables givesus:173Pr(x) =∑yPr(x, y) (A.3)A distinctive feature of conditioning a probability function on the eventXi = xi is that it is justified by the assumption that Xi = xi is a certainty.In a dynamic sense it may be a certainty by way of having been learned orobserved or acquired as evidence along the way. An intervention is oftendescribed in terms which make it sound quite similar to conditioning inthis respect. The simplest type of an intervention is an atomic interventionin which a single variable like Xi is forced to take a fixed value xi. It ispart of the definition as well as the conceptual content of the notion of anintervention that it has to be successful. If we intervene to force a value ona variable, it does indeed take that value with certainty. So in that way,intervening is similar to conditioning. The difference between conditioningand intervening lies in what can be inferred from the certainty that Xi takesvalue xi. When the certainty is achieved by forcing, the inferences are highlycircumscribed and local. A good deal of the information contained in thepre-intervention probability function is preserved unchanged. This pointwill come out best if we consider the formal representation of the notion ofan intervention.A.1 Decomposing a joint probability functionAccording to the product rule of probability theory, a joint probability func-tion Pr defined over n distinct variables can be decomposed as a product ofn individual conditional probability functions. As above, let the n variablesbe listed according to an arbitrary ordering X1, ...Xn. The product rulechains together by multiplication the conditional probability of any vari-able Xj given its predecessors in the series, in the exact order in which thevariables were enumerated:Pr(x1, ..., xn) =∏jPr(xj ∣ x1, ..., xj−1) (A.4)The product rule is easily derivable from the definition of conditional174probability. It has several notable features. First, it permits further simpli-fication of the product formula, if further independence relations are knownto obtain between specific sets of variables. Let Paj denote the predecessorsof Xj in the ordering with the property that conditional on known values ofthe variables in Paj , Xj is independent of the remaining set of predecessors.Then it may be possible to carry out a simplified instance of the productrule, obtaining:Pr(x1, ..., xn) =∏pajPr(xj ∣ paj) (A.5)This is especially apposite for representations of causal effects in termsof a directed acyclic graph. Given a DAG G and a probability function Pr,if Pr is decomposable in the order imposed by construction of the graphthen Pr is said to be Markov compatible with G. This property is of greatimportance to causal analysis. The second important feature of the productfactorization rule is that it is insensitive to the order in which the variablesare selected. A joint probability function over n variables can be equivalentlyfactorized in n! ways. Let σ be a permutation of X1, ...,Xn. Then Pr isequivalently factorized by:Pr(x1, ..., xn) =∏jPr(σ(xj) ∣ σ(x1), ..., σ(xj−1)) (A.6)For example, suppose that Pr is defined over X,Y,Z and that we wishto factorize it both in this order, and according to the permutation σ givingσ(X) = Y , σ(Y ) = X, and σ(Z) = Z. According to the first selection, weobtain:Pr(x)Pr(y ∣ x)Pr(z ∣ x, y) (A.7)If we factorize according to the permutation σ, we obtain a probabilisticallyequivalent (but as we shall see, intervention-inequivalent) factorization:Pr(y)Pr(x ∣ y)Pr(z ∣ x, y) (A.8)175With the notion of a factorization of a probability function to hand, wemay introduce the notion of an intervention. Given a factorized probabilityfunction, an intervention to set the value of a variable Xi to a fixed valuexi does two things (a) it knocks out the factor referring to the conditionalprobability of Xi given its predecessors, because there’s no need to thinkof Xi as being dependent on its predecessors, once its value is enforced (b)it substitutes the new value xi into the remaining factorization whereverdownstream references to the value of Xi occur i.e. wherever Xi is itselfconditioned upon. For example, suppose we intervene on the factorizationin (A.7) to force Y to take value y∗. To do this we remove the term encodingthe univariate conditional probability of Y , and instantiate the new value,to obtain:Pr(x)Pr(z ∣ x,Y = y∗) (A.9)Notation: on this occasion I am using an explicit notation “Y = y∗” tomake it clear that we’re now instantiating or realizing the value of Y .Now suppose we want to intervene to to force Y to take value y∗ but inthe second factorization we are considering in (A.8). The factor expressingthe conditional probability of Y is removed from the head of the factoriza-tion, and the value is instantiated, to obtain:Pr(x ∣ Y = y∗)Pr(z ∣ x,Y = y∗) (A.10)Thus, we have obtained two inequivalent post-intervention distributions,except for the special case in which X and Y happen to be probabilisticallyindependent. Although the factorizations we started from are probabilis-tically equivalent and cannot be distinguished by conditioning, they canbe distinguished by intervening. An intervention is an operation which issensitive to the order in which the recursive decomposition into conditionalprobability distributions is carried out.176A.1.1 Formal representationsThe effect of an intervention can be analyzed formally in a variety of ways.(1) As Bayesian conditionalization in an augmented directed acyclic graphto which a hypothetical link Fi → Xi is added, where Fi is a new variablerepresenting the intervention explicitly as a variable in the graph, whichtakes values in the set {do(x′i), idle}. To account for the addition of the newnode, a new conditional probability is stipulated for Xi, together with anarbitrary probability over Fi itself. The new conditional probability embod-ies the requirement that Xi is now subordinated to the new node, and thatif we so wish, we can cancel out the competing influences of the parents ofXi by conditioning on Fi = do(x′i) (a kind of local “go ahead” event):Pr(xi ∣ pai, Fi) =⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩Pr(xi ∣ pai) ifFi = idle,1 ifFi = do(x′i), xi = x′i0 ifFi = do(x′i), xi ≠ x′i.(A.11)Note that if we set if Fi = idle, then the graph reverts to its original prob-ability description, and for all purposes it’s as if the new node did not exist.The success condition (an intervention is not achieved unless it is successful)is reflected in the fact that if we set the variable to intervention mode, allpoints which fail to respect it insofar as the value for Xi is concerned areexpunged. Moreover, Fi is fully independent of the parents of Xi, due to thefact that Fi and the parent variables meet in a collider formation (to recall,a collider is blocked i.e. is not a carrier of probabilistic influence if neitherit nor any descendent of it is conditioned upon).(2)As a functional modification of a non-parametric structural equationsmodel in which the links in the directed acyclic graph are interpreted asencoding deterministic functions. Characterizing each child-parents rela-tionship as a deterministic function yields equations of of the form Xi =fi(pai, i), where i represent mutually independent, randomly distributed177disturbances or “error” terms. An intervention on Xi amounts to wipingout the equation for Xi and replacing it with a new equation Xi = xi andsubstituting the new value in the remaining parts of the functional model.(3)As a direct transformation between a pre-intervention and a post-intervention distribution which is Markov relative to a given DAG. In myview, this is the most semantically transparent construction of the notion ofan intervention. One gets to see immediately how probabilities are movedabout between the pre and post-intervention distributions.So again, suppose we want to intervene to force Xi to take some fixedvalue x′i. We can write down the condition for how the transformation is tobe effected directly, namely:Pr(x1, ..., xn ∣ do(x′i)) =⎧⎪⎪⎨⎪⎪⎩Pr(x1,...,xn)Pr(x′i∣pai)if xi = x′i,0 otherwise(A.12)The stochastic interpretation in (A.12) is the most general. The trans-formation of probabilistic information characterized by (A.12) is entailedequally by the augmented model analysis as by the functional analysis, orby an approach combining the two. Moreover, condition (A.12) is extremelytransparent from a semantical point of view. By looking at how the modi-fication of the information is effected, we can verify several assertions madeearlier. First, it is apparent that after an intervention is carried out, nopoints in the model for which the value of Xi is not the chosen x′i get tohave positive probability. A point in the model i.e. a n-tuple of values(x1, ..., xn) can only have positive probability if it agrees with the interven-tion about what value Xi is to take. Second, examining the termPr(x1,...,xn)Pr(x′i∣pai)we can see that every point that agrees on the values of parents of Xi willbe modified in weight of probability mass by the same factor, which is theinverse of the conditional probability Pr(x′i ∣ pai). The relative weights ofthese points will not be changed. Third, the parents of Xi do not havetheir probability changed in the intervention transformation. The second178and third point entail that the probability mass moved by an intervention ismoved inwardly, from an excluded point to a point which shares the value itgives to the parents of Xi. Intervening to force the probability of a variablekeeps it in the family (as it were). This is in stark contrast with Bayesianconditioning, which moves probability around redistributing it to every pre-served point by way of a normalization constant. For a comparison, we couldwrite the Bayesian condition with explicit “see” notation (and V referringto all other variables other than Xi) as follows:Pr(x1, ..., xn ∣ see(x′i)) =⎧⎪⎪⎨⎪⎪⎩Pr(x1,...,xn)∑v Pr(v⃗,x′i)if xi = x′i0 otherwise(A.13)Thus semantically what differs between Bayesian conditioning and in-tevening is thus how the transfer of probability mass is effected given thatboth operations work by the exclusion of certain points from the model.The simple notion of an atomic intervention characterized above canand has been extended in two ways. First, to complex interventions onseveral variables at once. Second, the notion has been generalized to applyto an intervention effected with a probability less than one. Given whatwe have said about the connections with Bayesian conditioning, we can seesimilarities in this respect as well, insofar as for example Jeffrey conditioningis also a generalization of the Bayesian rule to cases in which seeing doesnot yield an event of probability one.A.2 Connection with other frameworksConditioning and intervening represent different operations both in prob-abilistic (stochastic) and deterministic (functional) models. But the fun-damental distinction is echoed in a wide variety of formal settings. Forexample, in Lewisian systems for reasoning about counterfactuals, the dis-tinction between conditioning and intervening corresponds to that betweenconditioning and imaging81. Whereas in the AGM approach to qualitative81Lewis [45].179axiomatic belief revision, the distinction between conditioning and interven-ing corresponds to the difference between belief revision and belief update82.More recently, there have been attempts to apply the distinction betweenconditioning and intervening to systems in which the representation of un-certainty is not probabilistic, but possibilistic83. It is clear that the applica-bility of the distinction to a wide variety of formal contexts indicates thatit is a distinction of fundamental validity and importance.References: Pearl [62] and Spirtes et al. [76]. See also Lindley [48] for anexcellent short overview.82See Gärdenfors [21]; Gardenfors [20]; Katsuno and Mendelzon [40].83Benferhat [6].180

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0167719/manifest

Comment

Related Items