Information Theory and PartialBelief ReasoningbyStefan Hermann LukitsMaster of Arts (M. A.), The University of British ColumbiaMaster of Science (Mag. rer. nat.), Karl-Franzens Universita¨t Graz(University of Graz)A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Philosophy)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)May 2016c© Stefan Hermann Lukits 2016AbstractThe dissertation investigates the nature of partial beliefs and norms governing theiruse. One widely accepted (though not uncontested) norm for partial belief changeis Bayesian conditionalization. Information theory provides a far-reaching general-ization of Bayesian conditionalization and gives it a foundation in an intuition thatpays attention principally to information contained in probability distributions andinformation gained with new evidence.This generalization has fallen out of favour with contemporary epistemologists.They prefer an eclectic approach which sometimes conflicts with norms based oninformation theory, particularly the entropy principles of information theory. Theprinciple of maximum entropy mandates a rational agent to hold minimally informa-tive partial beliefs given certain background constraints; the principle of minimumcross-entropy mandates a rational agent to update partial beliefs at minimal informa-tion gain consistent with the new evidence. The dissertation shows that informationtheory generalizes Bayesian norms and does not conflict with them.It also shows that the norms of information theory can only be defended whenthe agent entertains sharp credences. Many contemporary Bayesians permit inde-terminate credal states for rational agents, which is incompatible with the norms ofinformation theory. The dissertation then defends two claims: (1) the partial beliefsthat a rational agent holds are formally expressed by sharp credences; and (2) whena rational agent updates these partial beliefs in the light of new evidence, the normsused are based on and in agreement with information theory.In the dissertation, I defuse a collection of counter-examples that have been mar-shaled against entropy principles. More importantly, building on previous work byothers and expanding it, I provide a coherent and comprehensive theory of the useiiAbstractof information theory in formal epistemology. Information theory rivals probabilitytheory in formal virtue, theoretical substance, and coherence across intuitions andcase studies. My dissertation demonstrates its significance in explaining the doxasticstates of a rational agent and in providing the right kind of normativity for them.iiiPrefaceThis dissertation is original, unpublished, independent work by the author, with theexception of chapter 5 (published in Lukits, 2015) and chapter 7 (published in Lukits,2014b).ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Information Theory and Probability Kinematics . . . . . . . . . . . . 11.2 Counterexamples and Conceptual Problems . . . . . . . . . . . . . . 71.3 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 The Principle of Maximum Entropy . . . . . . . . . . . . . . . . . . 152.1 Inductive Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 Information Theory and the Principle of Maximum Entropy . . . . . 192.3 Criticism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.4 Acceptance versus Probabilistic Belief . . . . . . . . . . . . . . . . . 312.5 Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 Changing Partial Beliefs Using Information Theory . . . . . . . . 373.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.2.1 Shannon Entropy . . . . . . . . . . . . . . . . . . . . . . . . 443.2.2 Kullback-Leibler Divergence . . . . . . . . . . . . . . . . . . . 463.2.3 Constraint Rule for Cross-Entropy . . . . . . . . . . . . . . . 50vTable of Contents3.2.4 Standard Conditioning . . . . . . . . . . . . . . . . . . . . . . 533.2.5 Jeffrey Conditioning . . . . . . . . . . . . . . . . . . . . . . . 543.3 Pragmatic, Axiomatic, and Epistemic Justification . . . . . . . . . . 563.3.1 Three Approaches . . . . . . . . . . . . . . . . . . . . . . . . 563.3.2 The Psychological Approach . . . . . . . . . . . . . . . . . . 573.3.3 The Formal Approach . . . . . . . . . . . . . . . . . . . . . . 583.3.4 The Philosophical Approach . . . . . . . . . . . . . . . . . . 614 Asymmetry and the Geometry of Reason . . . . . . . . . . . . . . . 644.1 Contours of a Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 644.2 Epistemic Utility and the Geometry of Reason . . . . . . . . . . . . 694.2.1 Epistemic Utility for Partial Beliefs . . . . . . . . . . . . . . . 694.2.2 Axioms for Epistemic Utility . . . . . . . . . . . . . . . . . . 704.2.3 Expectations for Jeffrey-Type Updating Scenarios . . . . . . 734.3 Geometry of Reason versus Information Theory . . . . . . . . . . . . 774.3.1 Evaluating Partial Beliefs in Light of Others . . . . . . . . . . 794.3.2 LP conditioning and Jeffrey Conditioning . . . . . . . . . . . 804.3.3 Triangulating LP and Jeffrey Conditioning . . . . . . . . . . 824.4 Expectations for the Geometry of Reason . . . . . . . . . . . . . . . 864.4.1 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864.4.2 Regularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874.4.3 Levinstein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884.4.4 Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.4.5 Expansibility . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.4.6 Horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924.4.7 Confirmation . . . . . . . . . . . . . . . . . . . . . . . . . . . 934.5 Expectations for Information Theory . . . . . . . . . . . . . . . . . . 1004.5.1 Triangularity . . . . . . . . . . . . . . . . . . . . . . . . . . . 1014.5.2 Collinear Horizon . . . . . . . . . . . . . . . . . . . . . . . . 1034.5.3 Transitivity of Asymmetry . . . . . . . . . . . . . . . . . . . 105viTable of Contents5 A Natural Generalization of Jeffrey Conditioning . . . . . . . . . . 1115.1 Two Generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.2 Wagner’s Natural Generalization of Jeffrey Conditioning . . . . . . . 1145.3 Wagner’s PME Solution . . . . . . . . . . . . . . . . . . . . . . . . . 1175.4 Maximum Entropy and Probability Kinematics Constrained by Con-ditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1206 A Problem for Indeterminate Credal States . . . . . . . . . . . . . 1286.1 Booleans and Laplaceans . . . . . . . . . . . . . . . . . . . . . . . . 1286.2 Partial Beliefs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1306.3 Two Camps: Bool-A and Bool-B . . . . . . . . . . . . . . . . . . . . 1376.4 Dilation and Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 1416.4.1 Dilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1426.4.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1456.5 Augustin’s Concessions . . . . . . . . . . . . . . . . . . . . . . . . . 1466.5.1 Augustin’s Concession (AC1) . . . . . . . . . . . . . . . . . . 1476.5.2 Augustin’s Concession (AC2) . . . . . . . . . . . . . . . . . . 1526.6 The Double Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1537 Judy Benjamin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1607.1 Goldie Hawn and a Test Case for Full Employment . . . . . . . . . . 1607.2 Two Intuitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1667.3 Epistemic Entrenchment . . . . . . . . . . . . . . . . . . . . . . . . . 1707.4 Coarsening at Random . . . . . . . . . . . . . . . . . . . . . . . . . 1757.5 The Powerset Approach . . . . . . . . . . . . . . . . . . . . . . . . . 1788 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190viiTable of ContentsAppendicesA Weak Convexity and Symmetry in Information Geometry . . . . 209B Asymmetry in Two Dimensions . . . . . . . . . . . . . . . . . . . . . 211C The Horizon Requirement Formalized . . . . . . . . . . . . . . . . . 214D The Hermitian Form Model . . . . . . . . . . . . . . . . . . . . . . . 217E The Powerset Approach Formalized . . . . . . . . . . . . . . . . . . 222viiiList of Figures3.1 The statistical model of the 2-dimensional simplex S2. y and z are the parame-ters, whereas x is fully determined by y and z. x, y, z represent the probabilitiesP (X1), P (X2), P (X3) on an outcome space Ω = {ω1, ω2, ω3}, where Xi is theproposition corresponding to the atomic outcome ωi. The point (y, z) correspondsto the numbers in example 11. . . . . . . . . . . . . . . . . . . . . . . . . 413.2 Illustration of an affine constraint. This is a Jeffrey-type updating scenario. Whilex = 1/3 as in figure 3.1, the updated x′ = 1/2. These are again the numbers ofexample 11. There appear to be two plausible ways to update (y, z). One way,which we will call Jeffrey conditioning, follows the line to the origin of the coor-dinate system. It accords with our intuition that as x′ → 1, the updated (y′1, z′1)is continuous with standard conditioning. The other way, which in chapter 4 Iwill call LP conditioning, uses the shortest geometric distance from (y, z) to de-termine (y′2, z′2). This case is illustrated in three dimensions (not using statisticalparameters) in figure 4.4. One of the main points of this dissertation is that theambivalence between updating methods can be resolved by using cross-entropyrather than Euclidean distance. Jeffrey conditioning gives us the updated proba-bility distribution which minimally diverges from the relatively prior probabilitydistribution in terms of cross-entropy. . . . . . . . . . . . . . . . . . . . . . 424.1 The simplex S2 in three-dimensional space R3 with contour lines correspondingto the geometry of reason around point A in equation (4.12). Points on the samecontour line are equidistant from A with respect to the Euclidean metric. Comparethe contour lines here to figure 4.2. Note that this diagram and all the followingdiagrams are frontal views of the simplex. . . . . . . . . . . . . . . . . . . . 66ixList of Figures4.2 The simplex S2 with contour lines corresponding to information theory aroundpoint A in equation (4.12). Points on the same contour line are equidistant fromA with respect to the Kullback-Leibler divergence. The contrast to figure 4.1will become clear in much more detail in the body of the chapter. Note thatthe contour lines of the geometry of reason are insensitive to the boundaries ofthe simplex, while the contour lines of information theory reflect them. One ofthe main arguments in this chapter is that information theory respects epistemicintuitions we have about asymmetry: proximity to extreme beliefs with very highor very low probability influences the topology that is at the basis of updating. . 674.3 An illustrations of conditions (i)–(iii) for collinear horizon in List B. p, p′ andq, q′ must be equidistant and collinear with m and ξ. If q, q′ is more peripheralthan p, p′, then collinear horizon requires that |d(p, p′)| < |d(q, q′)|. . . . . . 764.4 The simplex S2 in three-dimensional space R3 with points a, b, c as in equation(4.12) representing probability distributions A,B,C. Note that geometricallyspeaking C is closer to A than B is. Using the Kullback-Leibler divergence, how-ever, B is closer to A than C is. The same case modeled with statistical parametersis illustrated in figures 3.1 and 3.2. . . . . . . . . . . . . . . . . . . . . . . 784.5 The zero-sum line between A and C is the boundary line between the green area,where the triangle inequality holds, and the red area, where the triangle inequalityis violated. The posterior probability distribution B recommended by Jeffreyconditioning always lies on the zero-sum line between the prior A and the LPposterior C, as per equation (4.23). E is the point in the red area where thetriangle inequality is most efficiently violated. . . . . . . . . . . . . . . . . . 83xList of Figures4.6 Illustration for the six degree of confirmation candidates plus Carnap’s firmnessconfirmation and the Kullback-Leibler divergence. The top row, from left to right,illustrates FMRJ, the bottom row LGZI. ‘F’ stands for Carnap’s firmness confir-mation measure FP (x, y) = log(y/(1− y)). ‘M’ stands for candidate (i), MP (x, y)in (4.44), the other letters correspond to the other candidates (ii)-(v). ‘I’ standsfor the Kullback-Leibler divergence multiplied by the sign function of y − x tomimic a quantitative measure of confirmation. For all the squares, the colourreflects the degree of confirmation with x on the x-axis and y on the y-axis, allbetween 0 and 1. The origin of the square’s coordinate system is in the bottomleft corner. Blue signifies strong confirmation, red signifies strong disconfirmation,and green signifies the scale between them. Perfect green is x = y. GP lookslike it might pass the horizon requirement, but the derivative reveals that it failsconfirmation horizon (see appendix C). . . . . . . . . . . . . . . . . . . 984.7 These two diagrams illustrate inequalities (4.56) and (4.57). The former displaysall points in red which violate collinear horizon, measured from the centre.The latter displays points in different colours whose orientation of asymmetrydiffers, measured from the centre. The two red sets are not the same, but thereappears to be a relationship, one that ultimately I suspect to be due to the morebasic property of asymmetry. . . . . . . . . . . . . . . . . . . . . . . . . . 1057.1 Information that leads to unique solutions for probability updating using pmemust come in the form of an affine constraint (a constraint in the form of an in-formationally closed and convex space of probability distributions consistent withthe information). All information that can be processed by Jeffrey conditioningcomes in the form of an affine constraint, and all information that can be processedby standard conditioning can also be processed by Jeffrey conditioning. The solu-tions of pme are consistent with the solutions of Jeffrey conditioning and standardconditioning where the latter two are applicable. . . . . . . . . . . . . . . . . 1617.2 Judy Benjamin’s map. Blue territory (A3) is friendly and does not need to bedivided into a Headquarters and a Second Company area. . . . . . . . . . . . 162xiList of Figures7.3 Judy Benjamin’s updated probability assignment according to intuition T1. 0 <ϑ < 1 forms the horizontal axis, the vertical axis shows the updated probabilitydistribution (or the normalized odds vector) (q1, q2, q3). The vertical line at ϑ =0.75 shows the specific updated probability distribution Gind(0.75) for the JudyBenjamin problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1717.4 Judy Benjamin’s updated probability assignment using pme. 0 < ϑ < 1 forms thehorizontal axis, the vertical axis shows the updated probability distribution (orthe normalized odds vector) (q1, q2, q3). The vertical line at ϑ = 0.75 shows thespecific updated probability distribution Gmax(0.75) for the Judy Benjamin problem.1727.5 This choice of rectangles is not a candidate because the number of rectangles inA2 is not a t-multiple of the number of rectangles in A1, here with s = 2, t = 3 asin scenario III. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1797.6 This choice of rectangles is a candidate because the number of rectangles in A2 isa t-multiple of the number of rectangles in A1, here with s = 2, t = 3 as in scenarioIII. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1807.7 Judy Benjamin’s updated probability assignment according to the powerset ap-proach. 0 < ϑ < 1 forms the horizontal axis, the vertical axis shows the updatedprobability distribution (or the normalized odds vector) (q1, q2, q3). The verticalline at ϑ = 0.75 shows the specific updated probability distribution Gpws for theJudy Benjamin problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . 181B.1 The partition (4.59) based on different values for P . From top left to bot-tom right, P = (0.4, 0.4, 0.2);P = (0.242, 0.604, 0.154);P = (1/3, 1/3, 1/3);P =(0.741, 0.087, 0.172). Note that for the geometry of reason, the diagrams are triv-ial. The challenge for information theory is to explain the non-triviality of thesediagrams epistemically without begging the question. . . . . . . . . . . . . . . 213xiiChapter 1Introduction1.1 Information Theory and ProbabilityKinematicsThis dissertation defends two claims: (1) the partial beliefs that a rational agententertains are formally expressed by sharp credences; and (2) when a rational agentupdates these partial beliefs in the light of new evidence, the norms used are basedon and in agreement with information theory.A Bayesian framework for partial beliefs is assumed throughout. I will not explic-itly argue for this framework, although the overall integrity of my claims hopefullyserves to support the Bayesian approach to epistemology, especially the rules govern-ing belief change. Some say that the Bayesian approach conflicts with informationtheory. I will show that there is no such conflict. The account that I defend isa specific version of Bayesian epistemology. The major task of this dissertation istherefore to convince a Bayesian that a consistent application of the principles andintuitions leading to the Bayesian approach will lead to the acceptance of normswhich are based on or in agreement with information theory.If an agent’s belief state is representable by a probability distribution (or den-sity), then Bayesians advocate formal and normative guidelines for changing it inthe light of new evidence. Richard Jeffrey calls the investigation of these guidelines‘probability kinematics’ (see Jeffrey, 1965). Standard conditioning is the updatingprocedure considered by Bayesians to be objective and generally valid. The | opera-tor (probabilistic conditioning) provides a unique posterior probability distributionwhich accords with both basic probability axioms and intuitions about desired prop-erties of updated probabilities.11.1. Information Theory and Probability KinematicsAlthough almost all the claims, examples, and formal methods presented in thefollowing pertain to discrete and finite probability distributions, there are usuallyequivalent things to be said for distributions over infinite event spaces and for con-tinuous probability densities. Sometimes this requires some mathematical manoeu-vering (as is the case for the Shannon entropy), sometimes it requires no more thansubstitution of an integral sign for a sum sign. I will exclusively refer to finite prob-ability distributions, but many of the claims can be extended to suit the continuousor infinite case.Bayesian probability kinematics requires a prior probability distribution to pre-cede any meaningful evaluation of evidence and considers standard conditioning tobe mandatory for a rational agent when she forms her posterior probabilities, justin case her evidence is expressible in a form which makes standard conditioning anoption. There are various situations, such as the Judy Benjamin problem or theBrandeis Dice problem (see example 8), in which standard conditioning does not ap-pear to be an option (although some make the case that it is, so this point needs tobe established independently). Therefore the question arises whether there is roomfor a more general updating method, whose justification will entail a justification ofstandard conditioning, but not be entailed by it.E.T. Jaynes has suggested a unified updating procedure which generalizes stan-dard conditioning called the principle of maximum entropy, or pme for short. pmeadditionally to Bayesian probability kinematics uses information theory to developupdating methods which keep the entropy of probability distributions high (in thesynchronic case) and their cross-entropy low (in the diachronic case).pme is based on a subjective interpretation of probabilities, where probabilitiesrepresent the degree of uncertainty, or the lack of information, of the agent holdingthese probabilities. In this interpretation, probabilities do not represent frequenciesor objective probabilities, although there are models of how subjective probabilitiesrelate to them. pme enjoys a wide range of applications in science (for summaries seeShore and Johnson, 1980, 26; Buck and Macaulay, 1991; Karmeshu, 2003; Debbahand Mu¨ller, 2005, 1668; and Klir, 2006, 95). Most Bayesians reject the notion thatpme enjoys the same level of justification as standard conditioning and maintain that21.1. Information Theory and Probability Kinematicsthere are cases in which it delivers results which a rational agent should not, or isnot required to, accept.Note that I distinguish between separate problems: on the one hand, there may beobjective methods of determining probabilities prior to any evidence or observation(call these ‘absolutely’ prior probabilities), for example from some type of principleof indifference; on the other hand, there may be objective methods of determin-ing posterior probabilities from given prior probabilities (which themselves could beposterior probabilities in a previous instance of updating, call these ‘relatively’ priorprobabilities) in case standard conditioning does not apply.Another way to distinguish between absolutely prior distributions and relativelyprior distributions is to use Arnold Zellner’s ‘antedata’ and ‘postdata’ nomenclature(see Zellner, 1988). My work is concerned with a logic of belief change, not withobjectivism or convergence, although pme has been used to defend objectivism,notably by Jaynes himself (more about this in section 2.1).Formal epistemologists widely concur that pme is beset with too many concep-tual problems and counterexamples to yield a generally valid objective updatingprocedure. Against the tide of this agreement, I am mounting a defence of pme asa viable candidate for a generally valid, objective updating procedure. This defencealso identifies problems with pme, especially in chapter 4, but I consider these prob-lems to be surmountable. pme is unmatched by other updating procedures in itswide applicability, integration with other disciplines, and formal power (see chap-ter 4 for a comparison with the geometry of reason). Giving up on it means givingup on generality overall and settling with piecemeal approaches and a patchwork ofad hoc solutions to cases.Many of the portrayals of pme’s failings are flawed and motivated by a desire todemonstrate that the labour of the epistemologist in interpreting probability kine-matics on a case-by-case basis is indispensable. This ‘full employment theorem’ ofprobability kinematics is widely promulgated by textbooks and passed on to studentsas expert consensus. There are formally proven equivalents for the full employmenttheorem in computer science. Here is a naive characterization what full employmenttheorems are able to show in computer science:31.1. Information Theory and Probability KinematicsFull Employment Theorem No algorithm can optimally perform a particular taskdone by some class of professionals.The theorem ensures that a variety of tasks in computer science is performedby programs whose optimization is in principle an open question. There cannot bea proof that a particular program executes a task optimally. The theorem securesemployment for computer programmers, for example in developing search engines,spam filters, or virus detection. Rudolf Carnap discusses the “impossibility of an au-tomatic inductive procedure” (see Carnap, 1962, 192–199) making explicit referenceto Alan Turing, who is behind the full employment theorem in computer science.From now on, I will no longer make reference to the full employment theorems incomputer science, which only provide us with a suggestive metaphor for what is atstake in epistemology. The full employment in probability kinematics is the positionthat intuitions about particular cases are necessary in order to find epistemicallyjustified solutions to certain updating problems for partial beliefs. These intuitionscannot be contained by general updating procedures.Teddy Seidenfeld makes the case that the appeal of pme is for particular caseswith a view to pragmatic advantages, but that the general theory needs to recognizethe problems and maintain a watchful eye, given specific cases, which formalismproduces the best results:A pragmatic appeal to successful applications of the MAXENT formalism cannot bedismissed lightly. The objections that I raise in this paper are general. Whether (andif so, how) the researchers who apply MAXENT avoid these difficulties remains anopen question. Perhaps, by appealing to extra, discipline-specific assumptions theyfind ways to resolve conflicts within MAXENT theory. A case-by-case examination iscalled for. (Seidenfeld, 1986, 262.)A similar sentiment is expressed in Grove and Halpern, 1997, 210.Against this, a rule like cross-entropy seems extremely attractive. It provides a singlegeneral recipe which can be mechanically applied to a huge space of up dates [. . . ]41.1. Information Theory and Probability Kinematicssince all we do in this paper is analyze one particular problem, we must be careful inmaking general state ments on the basis of our results. Nevertheless, they do seemto support the claim that sometimes, the only ‘right’ way to update, especially in anincompletely specified situation, is to think very carefully about the real nature andorigin of the information we receive, and then (try to) do whatever is necessary to finda suitable larger space in which we can condition. If this doesn’t lead to conclusiveresults, perhaps this is because we do not understand the information well enoughto update with it. However much we might wish for one, a generally satisfactorymechanical rule such as cross-entropy, which saves us all this questioning and work,probably does not exist.This is echoed both in Halpern’s textbook Reasoning About Uncertainty, whereHalpern emphasizes that “there is no escaping the need to understand the detailsof the application” (2003, 423), and by Richard Bradley, who determines that much“will depend on the judgmental skills of the agent, typically acquired not in theinductive logic class but by subject specific training” (Bradley, 2005, 349).Grove and Halpern, in their article about the Judy Benjamin problem, admonishthe objectivists in accordance with the full employment theorem that “one mustalways think carefully about precisely what the information means” (Grove andHalpern, 1997, 6), and “the only right way to update, especially in an incompletelyspecified situation, is to think very carefully about the real nature and origin of theinformation we receive” (Grove and Halpern, 1997, 3). In chapter 7, I will give adetailed description of the Judy Benjamin problem.Igor Douven and Jan-Willem Romeijn advocate an innovative approach to theJudy Benjamin problem, in which epistemic entrenchment supplies a subjectivistperspective on these types of problem and delivers a Jeffrey partition so that insteadof maxent Jeffrey’s rule can be applied. They agree with Bradley that “even Bayes’rule ‘should not be thought of as a universal and mechanical rule of updating, butas a technique to be applied in the right circumstances, as a tool in what Jeffreyterms the art of judgment.’ In the same way, determining and adapting the weights[epistemic entrenchment] supposes, or deciding when Adams conditioning applies,51.1. Information Theory and Probability Kinematicsmay be an art, or a skill, rather than a matter of calculation or derivation from morefundamental epistemic principles” (Douven and Romeijn, 2009, 16) (for the Bradleyquote see Bradley, 2005, 362).Resisting this trend, E.T. Jaynes derisively talks about the statistician-clientrelationship as one between a doctor and her patient. “The collection of all thelogically unrelated medicines and treatments,” Jaynes explains, “that a sick patientmight need [is] too large for anyone but a dedicated professional to learn” (Jaynesand Bretthorst, 2003, 492.). To undercut this dependency, Jaynes suggests thatthe scientist who has learned the simple, unified principles of inference expounded herewould not consult a statistician for advice because he can now work out the details ofhis specific data analysis problem for himself, and if necessary write a new computerprogram, in less time than it would take to read about them in a book or hire it doneby a statistician. (Jaynes and Bretthorst, 2003, 506.)Diaconis and Zabell warn “that any claims to the effect that maximum-entropyrevision is the only correct route to probability revision should be viewed with con-siderable caution” (Diaconis and Zabell, 1982, 829). “Great caution” (Howson andFranklin, 1994, 456) is also what Colin Howson and Allan Franklin advise aboutthe claim that the posterior probabilities provided by maxent are as like the priorprobabilities as it is possible to be given the constraints imposed by the data.The promise of pme is that it combines Jaynes’ simple, unified principle witha sophisticated formal theory to confirm that the principle reliably works. Wherethere is inconsistency between epistemic intuitions and information theory, there ishope for deeper explanations. There is also the undeniable appeal of eclecticism anddoubts that one size will fit all. The proof of the pudding will be in the eating, andso I will have a detailed look at what the inconsistencies with epistemic intuition andtheir solutions may be.61.2. Counterexamples and Conceptual Problems1.2 Counterexamples and Conceptual ProblemsAlthough the role of pme in probability kinematics as a whole will be the scope of mywork, I will pay particular attention to a problem which has stymied its acceptanceby epistemologists: updating or conditioning on conditionals. Two counterexam-ples, Bas van Fraasen’s Judy Benjamin and Carl Wagner’s Linguist problem, arespecifically based on updating given an observation expressed as a conditional andon pme’s alleged failure to update on such observations in keeping with epistemicintuitions.Example 1: Deadpool. My 14-year-old wants to watch the movie Deadpool. We,the parents, disapprove. He argues that his friend James may be allowed to watchthe movie. James’ parents are friends with us. They have decided to watch the moviealone first in order to assess if it is appropriate for a 14-year-old. We ask them to sharetheir assessment with us. Let Sherry be representative for us and Charles for James’parents. If Charles gives a positive report (movie is appropriate for 14-year-olds),then in my son’s assessment the movie is appropriate for 14-year-olds. In Sherry’sassessment, however, the movie remains inappropriate and Charles’ reputation asassessor is compromised.In this example, my son is epistemically entrenched that Charles is a trustwor-thy assessor; Sherry is epistemically entrenched that Deadpool is inappropriate for14-year-olds. In terms of probabilities, what the agent learns from a specification ofP (B|A) depends not only on the prior probabilities but also on the epistemic en-trenchment of the agent. Sometimes, the agent is so committed to B that learning ahigh conditional probability dramatically lowers the probability of A but leaves theposterior for B unscathed. By contrast, another agent varies P (B) with what theylearn about P (B|A) instead of P (A). More epistemically minded examples are inexamples 39 and 40 on page 170. Igor Douven and Jan-Willem Romeijn propose thatHellinger’s distance measure is the appropriate measure for epistemic entrenchment,so that an agent is not committed to falling on one or the other side of the extremes,but could also be epistemically leaning in one or the other direction (see Douven andRomeijn, 2009, 12ff).71.2. Counterexamples and Conceptual ProblemsEpistemic entrenchment is problematic for pme because pme has no mechanismfor it. As will become clear in chapter 7, pme chooses a middling distribution betweenthe extremes of epistemic entrenchment—this will be interpreted as problematic be-cause in the Judy Benjamin case the intuition supposedly is one-sided with respectto epistemic entrenchment. Epistemic entrenchment updates on conditionals by as-suming a second tier of commitment to propositions beneath the primary tier ofquantitative degrees of uncertainty of belief such as probabilities (or other ways ofrepresenting uncertainty, such as ranks).pme conceptualizes this second tier differently and is consequently at odds withthe voluminous recent literature on epistemic entrenchment (for example Skyrms,who considers higher order probabilities in Skyrms, 1986, 238; but also Hempeland Oppenheim, 1945, 114; Domotor, 1980; Earman, 1992, 52; Gillies, 2000, 21).This issue is related to a dilemma that I develop in chapter 6 for indeterminatecredal states: they cannot perform the double task of representing both epistemicuncertainty and all relevant features of the evidence. The dilemma in this case,however, may be directed at pme: can it keep track of both the prior probabilitiesand of the epistemic entrenchments of the agent? If not, then there is a need forhigher order probabilities or another way to record the entrenchments.My line of argument emphasizes a strict dividing line between the epistemic andthe evidential. Epistemic states, such as epistemic entrenchment, are not evidenceand as such do not influence updating. At most, they influence how the relativelyprior probabilities are generated. If the entrenchment is based on evidence, then thatevidence has its usual bearing on the posterior probabilities. Higher-order probabil-ities are an attempt to locate more of the updating action in the realm of epistemicstates. pme resists the shifting of weight from the evidential to the epistemic, justas it resists the full employment theorem.A large part of my project is to address and defend pme’s performance withrespect to conditionals, both conceptually and with a view to threatening counterex-amples. One issue that comes to the forefront when addressing update on condition-als is whether a rational agent has sharp credences. The rejection of indeterminatecredal states used to be Bayesian dogma, but recent years have seen a substantial81.2. Counterexamples and Conceptual Problemsamount of work by Bayesians who defend and require indeterminate credal statesfor rational agents. I will show, using Wagner’s counterexample to the pme, that itis indeed inconsistent to embrace both the pme and an affirmative attitude towardsindeterminate credal states for rational agents. Wagner’s counterexample only worksbecause he implicitly assumes indeterminacy. For many contemporary Bayesians whoaccept indeterminacy Wagner’s argument is sound, even though it is enthymematic.A defence of pme must include an independent argument against indeterminacy,which I will provide in chapter 6.In summary, my thesis is that pme is defensible against all counterexamples andconceptual issues raised so far as a generally valid objective updating method inprobability kinematics. pme operates on the basis of a simple principle of ampliativereasoning (ampliative in the sense that the principle narrows the field of logical pos-sibilities): when updating your probabilities, waste no useful information and do notgain information unless the evidence compels you to gain it (see Jaynes, 1988, 280;van Fraassen 1986, 376; Zellner, 1988, 278; and Klir, 2006, 356). The principle comeswith its own formal apparatus (not unlike probability theory itself): Shannon’s infor-mation entropy, the Kullback-Leibler divergence, the use of Lagrange multipliers, andthe sometimes intricate, sometimes straightforward relationship between informationand probability.Once the counterexamples are out of the way, the more serious conceptual is-sues loom. On the one hand, there are powerful conceptual arguments affirmingthe special status of pme. Shore and Johnston, who use the axiomatic strategy ofCox’s theorem in probability kinematics, show that relatively intuitive axioms onlyleave us with pme to the exclusion of all other objective updating methods. VanFraassen, R.I.G. Hughes, and Gilbert Harman’s MUD method, for example, or theirmaximum transition probability method from quantum mechanics both fulfill theirfive requirements (see van Fraassen et al., 1986), but do not fulfill Shore and John-ston’s axioms. Neither does Uffink’s more general class of inference rules, whichmaximize the so-called Re´nyi entropies, but Uffink argues that Shore and Johnston’saxioms rest on unreasonably strong assumptions (see Uffink, 1995). Caticha andGiffin counter that Skilling’s method of induction (see Skilling, 1988) and Jaynes’91.2. Counterexamples and Conceptual Problemsempirical results in statistical mechanics and thermodynamics imply the uniquenessof Shannon’s information entropy over rival entropies.pme seamlessly generalizes standard conditioning and Jeffrey’s rule where theyare applicable (see Caticha and Giffin, 2006). It underlies the entropy concentrationphenomenon described in Jaynes’ standard work Probability Theory: the Logic ofScience, which also contains a sustained conceptual defence of pme and its underly-ing logical interpretation of probabilities. Entropy concentration refers to the uniqueproperty of the pme solution to have other distributions which obey the affine con-straint cluster around it. When used to make predictions whose quality is measuredby a logarithmic score function, posterior probabilities provided by pme result inminimax optimal decisions (see Topsøe, 1979; Walley, 1991; Gru¨nwald, 2000) so thatby a logarithmic scoring rule these posterior probabilities are in some sense optimal.Jeff Paris has investigated different belief functions (probabilities, Dempster-Shafer, and truth-functional, see Paris, 2006) from a mathematical perspective andcome to the conclusion that given certain assumptions about the constraints thatexperience normally imposes (I will have to examine their relationship to the affineconstraints assumed by pme), if a belief function is a probability function, only min-imum cross entropy belief revision satisfies a host of desiderata (continuity, equiva-lence, irrelevant information, open-mindedness, renaming, obstinacy, relativization,and independence) while competitors fail on multiple counts (see Paris and Ven-covska´, 1990).Belief revision literature, however, has in the last twenty years turned its atten-tion to the AGM paradigm (named after Carlos Alchourro´n, Peter Ga¨rdenfors, andDavid Makinson), which operates on the basis of fallible beliefs and their logicalrelationships. There are two different epistemic dimensions, to use Henry Kyburg’sexpression: one where doxastic states are cashed out in terms of fallible beliefs whichmove in and out of belief sets; and another where ‘beliefs’ are reflections of a moredeeply rooted, graded notion of uncertainty.Jeffrey with his radical probabilism pursues a project of epistemological monism(see Jeffrey, 1965) which would reduce beliefs to probabilities, while Spohn and Ma-her seek reconciliation between the two dimensions, showing how fallible full beliefs101.2. Counterexamples and Conceptual Problemsare epistemologically necessary and how the formal structure of the two dimensionsreveals many shared features so that in the end they have more in common thanwhat separates them (see Spohn, 2012, 201 and Maher, 1993).I am sympathetic to Spohn’s and Maher’s projects. pme can contribute to itin the sense of clarifying the dividing line between epistemic states and evidentialinput. The belief revision literature, the higher order probabilities approach, andthe investigation of epistemic entrenchment all emphasize the additional influencethat epistemic states have in the updating process over and above the evidence.Information theory does not lend itself to these considerations and tends to giverecommendations strictly based on evidential constraints, minimizing informationgain. It may be helpful to have this clarity in partial belief epistemology beforeembarking on a reconciliation project with full belief epistemology.There is a ‘docta ignorantia’ paradox in partial belief epistemology. While usingpartial beliefs rather than full beliefs at first suggests greater modesty, partial beliefspresent themselves as containing more information than full beliefs. It appears thata person with a full belief in A and qC and a non-commitment to either B orqB is informationally more modest than a partial believer who has credences ofCr(A) = 0.86, Cr(C) = 0.12, Cr(B) = 0.47. Greater epistemic modesty translatesinto informational immodesty of doxastic states. Some Bayesians seek to remedythis paradox by introducing indeterminate credal states. I will show in chapter 5that this strategy is inconsistent with pme. The problem with epistemic modestyexpressed in imprecision is that it assumes that representation of uncertainty is justanother belief. If I have a sharp credence, I have one particular belief; if I acceptimprecision, I am non-committal about a range of different beliefs, which on the faceof it sounds like epistemic modesty. The semantics of uncertainty, however, is suchthat representing uncertainty is not another belief. Information theory will help tobe more clear about the representation of epistemic states and the role of epistemicstates versus evidential input.111.3. Dissertation Outline1.3 Dissertation OutlineDespite the potholes in the historical development of the pme, on account of its uni-fying features, its simple and intuitive foundations, and its formal success it deservesmore attention in the field of belief revision and probability kinematics. pme is thesingle principle which can hold things together over vast stretches of epistemologicalterrain (intuitions, formal consistency, axiomatization, case management) and callsinto question the scholarly consensus that such a principle is not needed.Making this case, I proceed as follows. The first three chapters are introductory:preface; introduction; and a formal introduction to pme. In chapter 4, I address thegeometry of reason used by defenders of the epistemic utility approach to Bayesianepistemology to justify the foundational Bayesian tenets of probabilism and standardconditioning. The geometry of reason is the view that the underlying topology forcredence functions is a metric space, on the basis of which axioms and theorems ofepistemic utility for partial beliefs are formulated. It implies that Jeffrey condition-ing, which is implied by pme, must cede to an alternative form of conditioning. Thisalternative form of conditioning fails a long list of plausible expectations and impliesunacceptable results in certain cases.One solution to this problem is to reject the geometry of reason and acceptinformation theory in its stead. Information theory comes fully equipped with anaxiomatic approach which covers probabilism, standard conditioning, and Jeffreyconditioning. It is not based on an underlying topology of a metric space, butuses non-commutative divergences instead of a symmetric distance measure. I showthat information theory, despite initial promise, also fails to accommodate basicepistemic intuitions. The chapter, by bringing the alternative of information theoryto the forefront, does therefore not clean up the mess but rather provides its initialcartography.With information theory established as a candidate with both promise and areasof concern, but superior to the geometry of reason, chapters 5 and 6 address thequestion of indeterminate credal states. I begin with a problem that Wagner hasidentified for pme, the Linguist problem. Chapter 5 identifies the enthymematic121.3. Dissertation Outlinenature of the problem: it is not mentioned that one of its premises is the acceptanceof indeterminate credal states. Wagner’s reasoning is valid, therefore one must eitheraccept the conclusion and reject pme, or reject indeterminate credal states. I ded-icate one whole chapter, chapter 6, to an independent defence of mandatory sharpcredences for rational agents, thereby rejecting indeterminate credal states.In chapter 5, the story goes as follows: when we come to know a conditional, wecannot straightforwardly apply Jeffrey conditioning to gain an updated probabilitydistribution. Carl Wagner has proposed a natural generalization of Jeffrey condi-tioning to accommodate this case (Wagner conditioning). The generalization restson an ad hoc but plausible intuition (W), leaving ratios of partial beliefs alone inupdating when they are not affected by evidential constraints. Wagner shows howpme disagrees with intuition (W) and therefore considers pme to be undermined.Chapter 5 presents a natural generalization of Wagner conditioning which is derivedfrom pme and implied by it. pme is therefore not only consistent with (W), it seam-lessly and elegantly generalizes it (just as it generalizes standard conditioning andJeffrey conditioning).Wagner’s inconsistency result for (W) and pme is instructive. It rests on theassumption (I) that the credences of a rational agent may be indeterminate. Whilemany Bayesians now hold (I) it is difficult to articulate pme on its basis because todate there is no satisfactory proposal how to measure indeterminate probability dis-tributions in terms of information theory (see the latter part of section 6.2). Most, ifnot all, advocates of pme resist (I). If they did not they would be vulnerable to Wag-ner’s inconsistency result. Wagner has therefore not, as he believes, undermined pmebut only demonstrated that advocates of pme must accept that rational agents havesharp credences. Majern´ık shows that pme provides unique and plausible marginalprobabilities, given conditional probabilities. The obverse problem posed in chapter 5is whether pme also provides such conditional probabilities, given certain marginalprobabilities. The theorem developed to solve the obverse Majern´ık problem demon-strates that in the special case introduced by Wagner pme does not contradict (W)(which subsequently I also call Jeffrey’s Updating Principle or jup), but elegantlygeneralizes it and offers a more integrated approach to probability updating.131.3. Dissertation OutlineChapter 5 has the effect of welding together pme and a rejection of indetermi-nacy for credal states. Bayesians who permit indeterminate credal states will havetrouble accommodating Wagner’s result, while those defending pme find themselvesin the position of having to defend sharp credences as well. Chapter 6 makes thecase that there is integrity in a joint defence of pme and sharp credences. ManyBayesian epistemologists now accept that it is not necessary for a rational agent tohold sharp credences. There are various compelling formal theories how such a non-traditional view of credences can accommodate decision making and updating. Theyare motivated by a common complaint: that sharp credences can fail to representincomplete evidence and exaggerate the information contained in it. Indeterminatecredal states, the alternative to sharp credences, face challenges as well: they arevulnerable to dilation and under certain conditions do not permit learning. Chap-ter 6 focuses on two concessions that Thomas Augustin and James Joyce make toaddress these challenges. The concessions undermine the original case for indetermi-nate credal states. I use both conceptual arguments and hands-on examples to arguethat rational agents always have sharp credences.In chapter 7, I address a counterexample that opponents of objective methods todetermine updated probabilities prominently cite to undermine the generality of pme.This problem, van Fraassen’s Judy Benjamin case, is frequently used as the knock-down argument against pme. Chapter 7 shows that an intuitive approach to JudyBenjamin’s case supports pme. This is surprising because based on independenceassumptions the anticipated result is that it would support the opponents. It alsodemonstrates that opponents improperly apply independence assumptions to theproblem. Chapter 7 addresses the issue of epistemic entrenchment at length.14Chapter 2The Principle of MaximumEntropy2.1 Inductive LogicDavid Hume poses one of the fundamental questions for the philosophy of science,the problem of induction. There is no deductive justification that induction works,as the observations which serve as a basis for inductive inference are not sufficientto make an argument for the inductive conclusion deductively valid. An inductivejustification would beg the question. The late 19th and the 20th century bring us tworesponses to the problem of induction relevant to my project: (i) Bayesian epistemol-ogy and the subjective interpretation of probability focuses attention on uncertaintyand beliefs of agents, rather than measuring frequencies or hypothesizing about ob-jective probabilities in the world (John Maynard Keynes, Harold Jeffreys, Bruno deFinetti, Frank Ramsey; against, for example, R.A. Fisher and Karl Popper). (ii)Philosophers of science argue that some difficult-to-nail-down principle (indifference,simplicity, laziness, symmetry, entropy) justifies entertaining certain hypotheses moreseriously than others, even though more than one of them may be compatible withexperience (Ernst Mach, Carnap).Two pioneers of Bayesian epistemology and subjectivism are Harold Jeffreys (seeJeffreys, 1931, and Jeffreys, 1939) and Bruno de Finetti (see de Finetti, 1931 and1937). They personify a divide in the camp of subjectivists about probabilities.While de Finetti insists that any probability distribution could be rational for anagent to hold as long as it obeys the axioms of probability theory, Jeffreys considersprobability theory to be an inductive logic with rules, resembling the rules of deduc-152.1. Inductive Logictive logic, about the choice of prior and posterior probabilities. While both agreeon subjectivism in the sense that probabilities reflect an agent’s uncertainty (or, inJeffreys’ case, more properly a lack of information), they disagree on the subjectivistversus objectivist interpretation of how these probabilities are chosen by a rationalagent (or, in Jeffreys’ case, more properly by the rules of an inductive logic—as evenmaximally rational agents may not be able to implement them). The logical inter-pretation of probabilities begins with John Maynard Keynes (see Keynes, 1921), butsoon turns into a fringe position with Harold Jeffreys (for example Jeffreys, 1931)and E.T. Jaynes (for example Jaynes and Bretthorst, 2003) as advocates who arestandardly invoked for refutation. Carnap develops his own brand of a logical inter-pretation, which in some ways is analogous to his earlier conventionalism in geometry.For Carnap, there is a formal structure for partial beliefs and updating procedures,but within the formal structure there are flexibilities which allow the system to beadapted to the use to which it is put (see Carnap, 1952).The problem in part is that the logical interpretation cannot get off the groundwith plausible rules about how to choose absolutely prior probabilities. No one is ableto overcome the problem of transformation invariance for the principle of indifference(consider Bertrand’s paradox, see Paris, 2006, 71f), not even E.T. Jaynes (for hisattempts see Jaynes, 1973; for a critical response see Howson and Urbach, 2006, 285and Gillies, 2000, 48).One especially intractable problem for the principle of indifference is Ludwig vonMises’ water/wine paradox:Example 2: The Water/Wine Paradox. There is a certain quantity of liquid.All that we know about the liquid is that it is composed entirely of wine and waterand that the ratio of wine to water is between 1/3 and 3. What is the probabilitythat the ratio of wine to water is less than or equal to 2?There are two ways to answer this question which are inconsistent with eachother (see von Mises, 1964, 161). According to van Fraassen, example 2 shows whywe should “regard it as clearly settled now that probability is not uniquely assignableon the basis of a principle of indifference” (van Fraassen, 1989, 292). Van Fraassen162.1. Inductive Logicgoes on to claim that the paradox signals the ultimate defeat of the principle ofindifference, nullifying the efforts undertaken by Poincare´ and Jaynes in solving otherBertrand paradoxes (see Mikkelson, 2004, 137; and Gyenis and Re´dei, 2015). DonaldGillies calls von Mises’ paradox a “severe, perhaps in itself fatal, blow” to Keynes’logical theory of probability (see Gillies, 2000, 43). De Finetti’s subjectivism is anelegant solution to this problem and marginalizes the logical theory.While Jaynes throws up his hands over von Mises’ paradox, despite the successhe lands addressing Bertrand’s paradox (see Jaynes, 1973), Jeffrey Mikkelson hasrecently suggested a promising solution to von Mises’ paradox (see Mikkelson, 2004).There may still be hope for an objectivist approach to absolutely prior probabilities.Nevertheless, my dissertation remains agnostic about this problem. The domain ofmy project is probability kinematics. Relatively prior probabilities are assumed, andtheir priority only refers to the fact that they are prior to the posterior probabilities(one may call these distributions or densities antedata and postdata rather than priorand posterior, to avoid confusion) and not necessarily prior to earlier evidence.This raises a conceptual problem: why would anybody be interested in a defenceof objectivism in probability kinematics when the sense is that objectivism has failedabout absolutely prior probabilities? My intuition is in line with Keynes, who main-tains that all probabilities are conditional: “No proposition is in itself either probableor improbable, just as no place can be intrinsically distant; and the probability ofthe same statement varies with the evidence presented, which is, as it were, its originof reference” (see Keynes, 1909, chapter 1).The problem of absolutely prior probabilities is therefore moot, and it becomesclear that ‘objectivism’ is not really what I am advocating. Jaynes, who is initiallyinterested in objectivism about absolutely prior probabilities as well, seems to havecome around to this position when in his last work Probability Theory: The Logic ofScience he formally introduces probabilities as conditional probabilities (and laterasserts that “one man’s prior probability is another man’s posterior probability,” seeJaynes and Bretthorst, 2003, 89). Prior probabilities in this dissertation are foreveranterior, never original, as Roland Barthes puts it poetically in “The Death of theAuthor.” Alan Ha´jek also considers conditional probability the primitive notion and172.1. Inductive Logicunconditional probability the derivative notion, citing a list of supporters (see Ha´jek,2003, 315).The claim that all information-sharing minds ought to distribute their belief inthe same way could be called Donkin’s objectivism after the 19th century mathe-matician William Fishburn Donkin (see Zabell, 2005, 23). The logic of inductionin this dissertation resists Donkin’s objectivism and only provides rules for how toproceed from one probability distribution to another, given certain evidence. It is alogic without initial point (compare it perhaps to coordinate-free synthetic geometrywith its affine spaces). Just as in deductive logic, we may come to a tentative andvoluntary agreement on a set of rules and presuppositions and then go part of theway together. (For a more systematic analogy between deductive inference and prob-abilistic inference see Walley, 1991, 485; see also Kaplan, 2010, 43, for an interestingtake on this; as well as Huttegger, 2015, for a more systematic reconciliation projectbetween objectivism and resistance to Donkin’s austere version of objectivism.) L.J.Savage insists in his Foundations of Statistics that “the criteria incorporated in thepersonalistic view do not guarantee agreement on all questions among all honest andfreely communicating people, even in principle” (Savage, 1954, 67). My defence ofpme is not inconsistent with this principle.Bayesian probability kinematics rests on the idea that there are not only staticnorms about the probabilities of a rational agent, but also dynamic norms. The ratio-nal agent is not only constrained by the probability axioms, but also by standard con-ditioning as she adapts her probabilities to incoming evidence. Paul Teller presents adiachronic Dutch-book argument for standard conditioning (see Teller, 1973; Teller,1976), akin to de Finetti’s more widely accepted synchronic Dutch-book argument(for detractors of the synchronic Dutch-book argument see Seidenfeld et al., 1990;Foley, 1993, §4.4; and more recently Rowbottom, 2007; for a defence see Skyrms,1987a). Brad Armendt expands Teller’s argument for Jeffrey conditioning (see Ar-mendt, 1980). In contrast to the synchronic argument, however, there is considerableopposition to the diachronic Dutch-book argument (see Hacking, 1967; Levi, 1987;and Maher, 1992). Colin Howson and Peter Urbach make the argument that stan-dard conditioning as a diachronic norm of updating is inconsistent (see Howson and182.2. Information Theory and the Principle of Maximum EntropyUrbach, 2006, 81f).An alternate route to justifying subjective degrees of belief and their probabilisticnature is to use Ramsey’s approach of providing a representation theorem. Repre-sentation theorems make rationality assumptions for preferences such as transitivity(the standard reference work is still Sen, 1971) and derive from them a probabilityand a utility function which are unique up to acceptable transformations. Ramseyonly furnishes a sketch of how this can be done. The first fully formed representa-tion theorem is by Leonard Savage (see Savage, 1954); but soon Jeffrey notes thatits assumptions are too strong. Based on mathematical work by Ethan Bolker (seeBolker, 1966; and a summary for philosophers in Bolker, 1967), Jeffrey provides arepresentation theorem with weaker assumptions (in Jeffrey, 1978). Since then, repre-sentation theorems have proliferated (there is, for example, a representation theoremfor an acceptance-based belief function in Maher, 1993, and one for decision theoryin Joyce, 1999). They are formally more complex than Dutch-book arguments, butwell worth the effort because they make less controversial assumptions.2.2 Information Theory and the Principle ofMaximum EntropyBefore I can articulate my version of pme, I need to clarify what I mean by synchronicand diachronic constraints, especially in light of my comments about ignorance priors.I have been adamant not to touch absolutely prior probabilities in this dissertationand to limit myself to normativity about updating when relatively prior probabilitiesare already in place. This commitment is not arbitrary: I consider probabilities tobe essentially conditional, not unconditional. They do not appear ex nihilo. I haveno good answer at this point where these conditional probabilities come from—thatis a valid question, but not the one I am addressing here. Some sort of subjectivismwill have to do. It is important to me, however, and for similar reasons, that rationalagents have subjective probabilities for the events they consider. The idea of inde-terminate or imprecise probabilities is as suspicious to me as the idea of ignorance192.2. Information Theory and the Principle of Maximum Entropypriors. Both strike me as inappropriately metaphysical. In the case of indeterminateprobabilities, it is as if our uncertain doxastic states successfully referred to somesort of partial truth in nature, such as frequencies; in the case of ignorance priors,it is as if nature was obligated to have each experiment correspond uniquely to aprivileged set-up, for example in Bertrand’s paradox, so that the epistemic agentcan then apply the principle of indifference. Chapter 6 will delineate more formalreasons for the suspicion about indeterminacy.Yet, when I articulate pme, I make reference to synchronic constraints. Thediachronic constraints are not a problem: they result for example in the condition-ing that is characteristic of probability update, where relatively prior probabilitiesare assumed. Synchronic constraints, by contrast, presuppose that a set of eventscurrently has no subjective probabilities assigned. This appears to conflict withmy Laplacean realism, which stays away from ignorance priors and Donkin’s objec-tivism, but is committed to sharp probabilities for all events under consideration. Itis the disclaimer ‘under consideration’ that will give me the leverage for synchronicconstraints.As throughout this dissertation, there is a finite event space Ω with |Ω| = n. Anagent’s subjective probabilities distribution P is defined on a subset S of P(Ω), thepowerset of Ω. What I will allow, while retaining the out-of-bounds rule for ignorancepriors and indeterminate probabilities, is domain expansion for these probabilitydistributions. The simplest example is negation: if S is not an algebra, sometimesthe probability distribution can be expanded by setting P (−X) = 1−P (X). −X isthe complement of X (for more details on the model see section 3.1). In most casesit makes sense to require that S be an algebra, i.e. that it is closed with respect tocomplements and intersections and contains ∅.The example of a non-algebra, however, illustrates that sometimes we may wantto expand the domain of a subjective probability distribution to events that comeunder consideration, for example the negation of a proposition (the negation of aproposition corresponds to the complement of an event in an isomorphism betweenpropositions and events). This must not be done by introducing ignorance priors, butit can be done by having synchronic constraints for logically related propositions. It202.2. Information Theory and the Principle of Maximum Entropyis a synchronic constraint, for example, that the probability assigned to the negationof a proposition be such that P is still a probability distribution, therefore P (−X) =1− P (X).Here is another example that I will need in chapter 5. While events X and Ymay be under consideration, their intersection may not be. As with negation, thiswill only happen if the original S is not an algebra. In contrast to negation, however,many different joint probability distributions are compatible with the given marginalprobability distribution. The synchronic constraint to fill in the joint probabilitiesusing the probability axioms does not yield a unique distribution. This is where pmemakes a controversial recommendation: fill in the joint probabilities as if X and Ywere independent, unless you have information to the contrary. I will show how thisrecommendation derives from pme, explain its relation to independence, and defendit at various points in my dissertation.My version of pme goes as follows:If a rational agent is subject to constraints describing, but not uniquely identifyingher doxastic state in terms of coherent probability distributions, then the doxasticstate of the agent is characterized by the unique probability distribution identifiedby information theory, if it exists. If the constraints are synchronic, the principle ofmaximum entropy is used. If they are diachronic, the principle of minimum cross-entropy is used.Here are four examples to illuminate pme. For simplicity, I am assuming thatonly one of the following three candidates will be the eventual Republican nomineefor the presidential election in the US in 2016: Donald Trump, Ted Cruz, and MarcoRubio. In the following examples, a computer program called Rasputin which has noinformation but the information provided in the example is learning how to providerational betting odds for forced bets. I am using betting odds from a computerinstead of the doxastic state of a rational agent merely for illustration. Setting abetting threshold at p means that, if forced to either accept a bet on proposition Xat a cost of pˆ or on qX at a cost of 1 − pˆ, Rasputin will choose the former bet if212.2. Information Theory and the Principle of Maximum Entropyp > pˆ and the latter bet if p ≤ pˆ. Bets are always placed such that the prize is $1 ifthe proposition turns out to be true and nothing if it turns out to be false.Example 3: Trump Standard Conditioning. A prediction market like www.pivit.iotakes the following bets: 66 cents on the dollar for Donald Trump to be the Republi-can nominee. 21 cents on the dollar for Ted Cruz. 13 cents on the dollar for MarcoRubio. Cruz announces (reliably) that he will drop out of the race if he loses Texas.Rasputin sets the betting threshold for Trump at 83.544 and for Rubio at 16.456 centson the dollar in case Cruz loses Texas (this is a conditional bet which is called off withthe cost of the bet reimbursed if the antecedent does not hold true).Example 4: Trump Jeffrey Conditioning. Given the odds in example 3, theprediction market also takes 97 cents on the dollar bets that if Rubio loses Florida,he will not be the Republican nominee (another conditional bet). Rasputin sets thebetting threshold for Trump at 73.586 and for Cruz at 23.414 cents on the dollar incase Rubio loses Florida.Example 5: Trump Affine Constraint. Due to a transmission problem, Rasputinreceives an incomplete update of current betting ratios after the primary in Michigan.Given the odds in example 3, Rasputin can only add the information that if eitherTrump or Cruz win the nomination (another conditional bet), the bets on Trumpare 85 cents on the dollar. Rasputin now has the option of keeping Rubio’s oddsconstant at 13 cents or changing them in accordance with the principle of minimumcross-entropy (this is the Judy Benjamin problem). Rasputin applies the principle ofminimum cross-entropy and sets the betting threshold for Rubio at 13.289 cents onthe dollar. I will explain in subsection 3.2.3 how to calculate this number, using theconstraint rule for cross-entropy.Example 6: Trump Marginal Probabilities. Rasputin has the betting odds forTrump, Cruz, and Rubio on the Republican side (66:21:13) and the betting odds forClinton and Sanders on the Democratic side (82:18). Rasputin needs to set bettingthresholds for forced bets on the joint events, such as ‘Cruz will win the Republican222.2. Information Theory and the Principle of Maximum Entropynomination and Clinton will win the Democratic nomination’. Given the marginalprobabilities, many combinations of joint probabilities are probabilistically coherent.Rasputin chooses a unique set of joint probabilities by multiplying the marginal prob-abilities as in the following table because they have the highest entropy of those thatare probabilistically coherent.∧ Trump Cruz RubioClinton 0.5412 0.1722 0.1066 0.82Sanders 0.1188 0.0378 0.0234 0.180.66 0.21 0.13 1.00As for historical development, when Jaynes introduces his version of pme (seeJaynes, 1957a; 1957b), he is less indebted to the philosophy of science project ofgiving an account of semantic information (as in Carnap and Bar-Hillel, 1952; 1953)than to Claude Shannon’s mathematical theory of information and communication.Shannon identifies information entropy with a numerical measure of a probabilitydistribution fulfilling certain requirements (for example, that the measure is additiveover independent sources of uncertainty). The focus is not on what information is buthow we can formalize an axiomatized measure. Entropy stands for the uncertaintythat is still contained in information (certainty is characterized by zero entropy).Shannon introduces information entropy in 1948 (see Shannon, 2001), based onwork done by Norbert Wiener connecting probability theory to information theory(see Wiener, 1939). Jaynes also traces his work back to Ludwig Boltzmann andJosiah Gibbs, who build the mathematical foundation of information entropy byinvestigating entropy in statistical mechanics (see Boltzmann, 1877; Gibbs, 1902).For the further development of pme in probability kinematics it is importantto refer to the work of Richard Jeffrey, who establishes the discipline (see Jeffrey,1965), and Solomon Kullback, who provides the mathematical foundations of mini-mum cross-entropy (see Kullback, 1959). In probability kinematics, contrasted withstandard conditioning, evidence is uncertain (for example, the ball drawn from anurn may have been observed only briefly and under poor lighting conditions).232.2. Information Theory and the Principle of Maximum EntropyJeffrey addresses many of the conceptual problems attending probability kine-matics by providing a much improved representation theorem, thereby creating atight connection between preference theory and its relatively plausible axiomaticfoundation and a probabilistic view of ‘beliefs.’ Jeffrey and Isaac Levi’s (see Levi,1967) debates on partial belief and acceptance (revisable and not necessarily cer-tain full belief), which Jeffrey considered to be as opposed to each other as Draculaand Wolfman (see Jeffrey, 1970), set the stage for two ‘epistemological dimensions’(Henry Kyburg’s term, see Kyburg, 1995, 343), which will occupy us in detail andtowards which I will take a more conciliatory approach, as far as their opposition ormutual exclusion is concerned.Kullback’s divergence relationship between probability distributions makes pos-sible a smooth transition from synchronic arguments about absolutely prior prob-abilities to diachronic argument about probability kinematics (this transition ismuch more troublesome from the synchronic Dutch-book argument to the diachronicDutch-book argument; for the information-theoretic virtues of the Kullback-Leiblerdivergence see Kullback and Leibler, 1951; Seidenfeld, 1986, 262ff; Guias¸u, 1977,308ff).Jaynes’ project of probability as a logic of science is orginally conceived to provideobjective absolutely prior probabilities by using pme, rather than to provide objectiveposterior probabilities, given relatively prior probabilities. It is, however, easy to turnpme into a method of probability kinematics using the Kullback-Leibler divergence.Jaynes presents this method in 1978 at an MIT conference under the title “WhereDo We Stand on Maximum Entropy?” (see Jaynes, 1978), where he explains theBrandeis problem and demonstrates the use of Lagrange multipliers in probabilitykinematics.Ariel Caticha and Adom Giffin have recently demonstrated, using Lagrange multi-pliers, that pme seamlessly generalizes standard conditioning (see Caticha and Giffin,2006). Many others, however, think that in one way or another pme is inconsistentwith standard conditioning, to the detriment of pme (see Seidenfeld, 1979, 432f;Shimony, 1985; van Fraassen, 1993, 288ff; Uffink, 1995, 14; and Howson and Ur-bach, 2006, 278); Jon Williamson believes so, too, but to the detriment of standard242.3. Criticismconditioning (see Williamson, 2011).Arnold Zellner proves that standard conditioning as a diachronic updating rule(Bayes’ theorem) is the “optimal information processing rule” (Zellner, 1988, 278),also using Lagrange multipliers (Jaynes already has this hunch in Jaynes, 1988, 280;see also van Fraassen et al., 1986, 376). Standard conditioning is neither inefficient(using a suitable information metric), diminishing the output information comparedto the input information, nor does it add extraneous information. This is just thesimple conceptual idea behind pme, although pme only requires optimality, not fullefficiency. Full efficiency implies optimality, therefore standard conditioning fulfillspme.Once pme is formally well-defined and its scope established (for the latter, ImreCsisza´r’s work on affine constraints is important, see Csisza´r, 1967), its virtues cometo the foreground. While Richard Cox (see Cox, 1946) and E.T. Jaynes defendthe idea of probability as a formal system of logic, John Shore and Rodney Johnsonprovide the necessary detail to establish the uniqueness of pme in meeting intuitivelycompelling axioms (see Shore and Johnson, 1980).2.3 CriticismBy then, however, an avalanche of criticism against pme as an objective updatingmethod had been launched. Papers by Abner Shimony (see Friedman and Shimony,1971; Dias and Shimony, 1981; Shimony, 1993) note the following problem.Example 7: Shimony’s Lagrange Multiplier. You mail-order a four-sided diewhose spots are 1, 2, 3, and 6 respectively (I am using this example leaning on Jaynes’Brandeis Dice problem). The die is mailed to a lab, where a graduate student takesit out of its package and rolls it thousands of times. After writing down the long-runaverage µ, she rolls it one more time and reports the result of the last die roll to you.Xk is the proposition that the die roll comes up k spots on the last die roll. Before youhear the result, you have no idea what the die’s bias is. Your relative priors happento be equiprobabilities, so P (Xk) = 1/4 for k = 1, 2, 3, 6.252.3. CriticismLet Yz be the event indicating that the long-run average as recorded by the graduatestudent is z with z ∈ [1, 6] ⊂ R. h is the probability density for Yz,P (a ≤ z ≤ b) =∫ bah(z) dz, (2.1)and gk(z) is the probability density for P (Xk|Yz),PXk| ⋃z∈[a,b]Yz = ∫ bagk(z) dz. (2.2)The law of total probability gives us the following (integrals without subscript areintegrals over R):14= P (Xk) =∫gk(z)h(z) dz, (2.3)Let ν be the measure associated with the probability density h,ν(A) = P (⋃z∈AYz), (2.4)where A is a measurable subset of R. In my example, ν(A) is trivially zero forA ∩ (1, 6) = ∅. Then I may write for any k = 1, 2, 3, 6,P (Xk) =∫e−βk∑i=1,2,3,6 e−βi dµ (2.5)according to the constraint rule of cross-entropy, which will be introduced in sub-section 3.2.3. There is an isomorphism ϕ between the long-run average z and the262.3. CriticismLagrange multiplier β which provides the basis for the definition of the measureµ(B) = ν(ϕ−1(B)) for any measurable B ⊆ R. (2.6)Simplifying (2.5) toP (Xk) =∫ ( ∑i=1,2,3,6eβ(k−i))−1dµ =∫I(β) dµ, (2.7)the derivative of the integrand I(β) isdI(β)dβ= −( ∑i=1,2,3,6eβ(k−i))−2 ∑i=1,2,3,6(k − i)eβ(k−i), (2.8)which vanishes at β = 0 for k = 3. Let us assume k = 3 from now on. Shimony ismaking sure that k is one of the possible outcomes and at the same time their mean.In example 7, I fulfill this requirement by choosing 1, 2, 3, 6, where 3 is the mean. Aregular die, whose fair mean is 3.5, would not fulfill it because 3.5 is not one of theoutcomes.Straightforward calculations show that I(β) has a maximum at β = 0, which isthe only extremum (a maximum) of I(β). To summarize the properties of I, (i) it isextremal at zero; (ii) I(0) = 1/4; and (iii)∫I(β) dµ = 1/4. These three propertiesimply that µ({0}) = 1. To see this, consider the following sequence of subsets of R:Ln ={β : I(β) <14− 1n}. (2.9)Then,272.3. Criticism14=∫I(β) dx =∫LnI(β) dµ+∫R\LnI(β) dµ ≤(14− 1n)µ(Ln) +14(1− µ(Ln)) = 14− µ(Ln)n,(2.10)so µ(Ln) = 0. Now by countable additivity,µ(R\{0}) = µ(⋃n∈NLn)= 0. (2.11)Before you know anything else, you ought to know with certainty that the long-run average recorded by the graduate student is 3, which is the value of z corre-sponding to β = 0. This conflicts with the assumption that you know nothing aboutthe die when it arrives in the mail.Shimony’s more general description of example 7 convinces among many othersBrian Skyrms that pme and its objectivism are not tenable (see Skyrms, 1985; 1986;and 1987b). In section 2.5, I am providing literature to defend pme against the claimsof Shimony’s Lagrange multiplier problem. In a nutshell, I can say that it is theisomorphism ϕ in conjunction with (2.2) that draws Jaynes’ withering criticism (seeJaynes, 1985, 134ff). While z is a legitimate statistical parameter of the distributionforXk, β is an artefact of the mathematical process to find a posterior probability, nota conditional probability. (2.3), however, is only true for conditional probabilities.The mathematical artefact β is defined by the procedure and therefore not uncertainas the parameter z would be, even if there is a one-one correspondence between theirvalues. z is within the domain of the prior probability distribution and its conditionalprobabilities, β is not.Following Shimony’s Lagrange multiplier problem, Bas van Fraassen’s Judy Ben-jamin problem (see van Fraassen, 1981) deals another blow to pme in the literature,motivating Joseph Halpern (who already has reservations against Cox’s theorem, see282.3. CriticismHalpern, 1999) to reject it in his textbook on uncertainty (see Halpern, 2003). I willcover the Judy Benjamin problem at length in chapter 7.Teddy Seidenfeld runs a campaign against objective updating methods in articlessuch as “Why I Am Not an Objective Bayesian” (see Seidenfeld, 1979; 1986). JosUffink takes issue with Shore and Johnson, casting doubt on the uniqueness claimsof pme (see Uffink, 1995; 1996). Carl Wagner introduces a counterexample to pme(see Wagner, 1992), again, as in the Judy Benjamin counterexample (but in muchgreater generality), involving conditioning on conditionals.In 2003, Halpern criticizes pme with the help of Peter Gru¨nwald and the conceptof ‘coarsening at random,’ which according to the authors demonstrates that thepme “essentially never gives the right results” (see Gru¨nwald and Halpern, 2003,243). A CAR condition specifies the requirements necessary to keep conditioningon a naive space consistent with conditioning on a sophisticated space. De Finetti’sexchangeability is an example for a CAR condition: as long as exchangeability holds,conditioning on a simpler event space yields results that are consistent with condi-tioning on a more complicated event space. Gru¨nwald and Halpern seek to show thatan application of pme to a naive space is almost always inconsistent with the properapplication of standard conditioning to the corresponding sophisticated space. Espe-cially in the Judy Benjamin case, the fact that there is no non-trivial CAR conditionfor pme undermines the result that pme provides. I address this question in moredetail in section 7.4.In 2009, Douven and Romeijn write an article on the Judy Benjamin problem(see Douven and Romeijn, 2009) in which they ask probing questions about thecompatibility of objective updating methods with epistemic entrenchment. I willaddress these questions in section 7.3.Malcolm Forster and Elliott Sober’s attack on Bayesian epistemology using Akai-ke’s Information Criterion is articulated in the 1990s (see Forster and Sober, 1994)but reverberates well into the next decade (Howson and Urbach call the authorsthe ‘scourges of Bayesianism’). Because the attack concerns Bayesian methodologyas a whole, it is not within my purview to defend pme against it (for a defence ofBayesianism see Howson and Urbach, 2006, 292ff), but it deserves mention for its292.3. Criticismdirect reference to information as a criterion for inference and provides an interestingpoint of comparison for maximum entropy.Another criticism which affects both the weaker Bayesian claim for standard con-ditioning and the stronger pme is its purported excessive apriorism, i.e. the concernthat the agent can never really move away from beliefs once formed—and that thosebeliefs always need to be fully formed all the time. It can be found as early as 1945in Carl Hempel (see Hempel and Oppenheim, 1945, 107) and is raised again as lateas 2005 by James Joyce (see Joyce, 2005, 170f). Peter Walley uses a similar point tocriticize the Bayesian position (see Walley, 1991, 334). In Bayes or Bust?, excessiveapriorism (among other things) leads John Earman to his tongue-in-cheek positionof being a Bayesian only on Mondays, Wednesdays, and Fridays (see Earman, 1992,1; for the detailed criticism Earman, 1992, 139f).Gillies also has these reservations (see Gillies, 2000, 81; 84) and cites de Finettiin support,If you want to apply mathematics, you must act as though the measured magnitudeshave precise values. This fiction is very fruitful, as everybody knows; the fact that it isonly a fiction does not diminish its value as long as we bear in mind that the precisionof the results will be what it will be . . . to go, with the valid help of mathematics,from approximate premises to approximate conclusions, I must go by way of an exactalgorithm, even though I consider it an artifice. (De Finetti, 1931, 303.)Seidenfeld militates against objective Bayesianism in 1979 using excessive apri-orism (I owe the term to him, see Seidenfeld, 1979, 414). Again, because the chargeis directed at Bayesians more generally, I do not need to address it, but mentionit because pme may have resources at its disposal that the more general Bayesianposition lacks (for this position, see Williamson, 2011).A more recent criticism centres on the concept of epistemic entrenchment, which Iwill introduce in the next section. While epistemic entrenchment has attracted atten-tion, not least on account of some interesting formal features and the ease with whichit treats updating on conditionals, information theory has difficulty accommodatingit.302.4. Acceptance versus Probabilistic Belief2.4 Acceptance versus Probabilistic BeliefEpistemic entrenchment figures prominently in the AGM literature on belief revision(for one of its founding documents see Alchourro´n et al., 1985) and is based on twolevels of uncertainty about a proposition: its static inclusion in belief sets on the onehand, and its dynamic behaviour under belief revision on the other hand. It is onething, for example, to think that the probability of a coin landing heads is 1/2 andconsider it fair because you have observed one hundred tosses of it, or to think thatthe probability of a coin landing heads is 1/2 because you know nothing about it. Inthe former scenario, your belief that P (X = H) = 0.5 is more entrenched.Wolfgang Spohn provides an excellent overview of the interplay between Bayesianprobability theory, AGM belief revision, and ranking functions (see Spohn, 2012).The extent to which pme is compatible with epistemic entrenchment and a distinctionbetween the static and the dynamic level will be a major topic of my investigation.At first glance, pme and epistemic entrenchment are at odds, because pme operateswithout recourse to a second epistemic layer behind probabilities expressing uncer-tainty. My conclusion is that the content of this layer is expressible in terms ofevidence and is not epistemic.For a long time, there has been unease between defenders of partial belief (suchas Richard Jeffrey) and defenders of full (and usually defeasible or fallible) belief(such as Isaac Levi). This issue is viewed more pragmatically beginning in the 1990swith Patrick Maher’s Betting on Theories (see Maher, 1993) and Wolfgang Spohn’swork in several articles (later summarized in Spohn, 2012). Both authors seek todownplay the contradictory nature of these two approaches and emphasize how bothare necessary and able to inform each other.Maher argues that representation theorems are superior to Dutch-book argumentsin justifying Bayesian methodology, but then distinguishes between practical utilityand cognitive utility. Whereas probabilism is appropriate in the arena of acting,based on practical utility, acceptance is appropriate in the arena of asserting, basedon cognitive utility. Maher then provides his own representation theorem with respectto cognitive utility, underlining the resemblance in structure between the probabilistic312.5. Proposaland the acceptance-based approach.In a similar vein, Spohn demonstrates the structural similarities between the twoapproaches using ranking theory for the acceptance-based approach. Together withthe formal methods of the AGM paradigm, ranking theory delivers results that areanalogous to the already well-formulated results of Bayesian epistemology. Maherand Spohn put us on the right track of reconciliation between the two epistemologicaldimensions, and I hope to contribute to it by showing that pme can be coherentlycoordinated with this reconciliation. This will only be possible if I clarify the relationthat pme has to epistemic entrenchment and how it conditions on conditionals. Formore detail on this see especially section 7.3.2.5 ProposalMy interpretation of pme is an intermediate position between what I would callJaynes’ Laplacean idealism, where evidence logically prescribes unique and determi-nate probability distributions to be held by rational agents; and a softened versionof Bayesianism exemplified by, for example, Richard Jeffrey and James Joyce (forthe latter see Joyce, 2005).I side with Jaynes in so far as I am committed to determinate prior probabilities,whether they are absolute or relative (why this is important will become clear inchapter 5). Once a rational agent considers a well-defined event space, the agent isable to assign fixed numerical probabilities to it (this ability is logical, not practical—in practice, the assignment may not be computationally feasible). This assignmentis either irreducibly relative to the updating process (where conditional probabili-ties are conceptually foundational to unconditional probabilities, not the other wayaround), or it depends on logical relationships to existing probabilities which providethe necessary synchronic constraints. Because there is no objectivity in the geneal-ogy of conditional probabilities and in the interpretation of evidence, both of whichintroduce elements of subjectivity, I part ways with Jaynes on objectivity.‘Humanly faced’ Bayesians such as Jeffrey or Joyce claim that rational agentstypically lack determinate subjective probabilities and that their opinions are char-322.5. Proposalacterized by imprecise credal states in response to unspecific and equivocal evidence.There is a difference, however, between (1) appreciating the imprecision in inter-preting observations and, in the context of probability updating, casting them intoappropriate mathematical constraints for updated probability distributions, and (2)bringing to bear formal methods to probability updating which require numericallyprecise priors. The same is true for using calculus with imprecise measurements.The inevitable imprecision in our measurements does not attenuate the logic of us-ing real analysis to come to conclusions about the volume of a barrel or the area ofan elliptical flower bed. My project will seek to articulate these distinctions moreexplicitly.My project therefore promotes what I would call Laplacean realism, contrastingit with Jaynes’ Laplacean idealism (Sandy Zabell uses the less complimentary term“right-wing totalitarianism” for Jaynes’ position, see Zabell, 2005, 28), but also dis-tinguishing it from contemporary softened versions of Bayesianism such as Joyce’sor Jeffrey’s (Zabell’s corresponding term is “left-wing dadaists,” although he doesnot apply it to Bayesians). What is distinctive about my approach to Bayesianismis the high value I assign to the role that information theory plays within it. Mycontention is that information theory provides a logic for belief revision. Almost allepistemologists who are Bayesians currently have severe doubts that information the-ory can deliver on this promise, not to mention their doubts about the logical natureof belief revision (see for example Zabell, 2005, 25, where he repeatedly charges thatadvocates of logical probability have never successfully addressed Ramsey’s “simplecriticism” about how to apply observations to the logical relations of probabilities).One way in which these doubts can be addressed is by referring them to themore general debate about the relationship between mathematics and the world.The relationship between probabilities and the events to which they are assignedis not unlike the relationship between the real numbers I assign to the things Imeasure and calculate and their properties in the physical world. As unsatisfying asour understanding of the relationship between formal apparatus and physical realitymay be, the power, elegance, and internal consistency of the formal methods is rarelyin dispute. Information theory is one such apparatus, probability theory is another.332.5. ProposalIn contemporary epistemology, their relationship is held to be at best informative ofeach other. Whenever there are conceptual problems or counterintuitive examples,the two come apart. I consider the relationship to be more substantial than currentlyassumed.There have been promising and mathematically sophisticated attempts to defineprobability theory in terms of information theory (see for example Ingarden and Ur-banik, 1962; Kolmogorov, 1968; Kampe´ de Fe´riet and Forte, 1967—for a detractorwho calls information theory a “chapter of the general theory of probability” seeKhinchin, 1957). While interesting, making one of information theory or probabilitytheory derivative of the other is not my project. What is at the core of my project isthe idea that information theory delivers the unique and across the board successfulcandidate for an objective updating mechanism in probability kinematics. This ideais unpopular in the literature, but the arguments on which the literature relies arenot robust, neither in quantity nor in quality. There are only a handful of coun-terexamples, none of which are ultimately resistant to explanation by a carefullyarticulated version of my claim.Carnap advises pragmatic flexibility with respect to inductive methods, althoughhe presents only a one-dimensional parameter system of inductive methods whichcurbs the flexibility. On the one hand, Dias and Shimony report that Carnap’s λ-continuum of inductive methods is consistent with pme only if λ =∞ (see Dias andShimony, 1981). This choice of inductive method is unacceptable even to Carnap,albeit allowed by the parameter system, because it gives no weight to experience(see Carnap, 1952, 37ff). On the other hand, Jaynes makes the case that pme entailsLaplace’s Rule of Succession (λ = 2) and thus occupies a comfortable middle positionbetween giving all weight to experience (λ = 0, for the problems of this position seeCarnap, 1952, 40ff) or none at all (λ = ∞). While Carnap’s parameter systemof inductive methods rests on problematic assumptions, I surmise that Dias andShimony’s assignment of λ, given pme, is erroneous, and that Jaynes’ assignment isbetter justified (for literature see the next paragraph). Since this is an old debate, Iwill not resurrect it here.In order to defend my position, I need to address three important counterexamples342.5. Proposalwhich at first glance discredit pme: Shimony’s Lagrange multiplier problem, vanFraassen’s Judy Benjamin case, and Wagner’s Linguist. For the latter two, I amconfident that I can make a persuasive case for pme, based on formal features ofthese problems which favour pme on closer examination. For the former (Shimony),Jaynes has written a spirited rebuttal (see Jaynes, 1985, 134ff). According to Jaynes,errors in Shimony’s argument have been pointed out five times (see Hobson, 1972;Tribus and Motroni, 1972; Gage and Hestenes, 1973; Jaynes, 1978; Cyranski, 1979).This does not, however, keep Brian Skyrms, Jos Uffink, and Teddy Seidenfeld fromreferring again to Shimony’s argument in rejecting pme in the 1980s.As mentioned before, Jeffrey with his radical probabilism pursues a project ofepistemological monism (see Jeffrey, 1965) which would reduce beliefs to probabili-ties, while Spohn and Maher seek reconciliation between the two dimensions, showinghow fallible full beliefs are epistemologically necessary and how the formal structureof the two dimensions reveals many shared features so that in the end they havemore in common than what separates them (see Spohn, 2012, 201 and Maher, 1993).In the end, my project is not about the semantics of doxastic states. I do notargue the eliminativism of beliefs in favour of probabilities; on the contrary, the beliefrevision literature has opened an important door for inquiry in the Bayesian dimen-sion with its concept of epistemic entrenchment. This is a good example for the kindof cross-fertilization between the two different dimensions that Spohn has in mind,mostly in terms of formal analogies and with little worry about semantics, impor-tant as they may be. Maher provides similar parallels between the two dimensions,also with an emphasis on formal relationships, in terms of representation theorems.Pioneering papers in probabilistic versions of epistemic entrenchment are recent (seeBradley, 2005; Douven and Romeijn, 2009).The guiding idea behind epistemic entrenchment is that once an agent is apprisedof a conditional (indicative or material), she has a choice of either adjusting hercredence in the antecedent or the consequent (or both). Often, the credence in theantecedent remains constant and only the credence in the consequent is adjusted(Bradley calls this ‘Adams conditioning’). Douven and Romeijn give an examplewhere the opposite is plausible and the credence in the consequent is left constant352.5. Proposal(see Douven and Romeijn, 2009, 12). Douven and Romeijn speculate that an agentcan theoretically take any position in between, and they use Hellinger’s distance torepresent these intermediary positions formally (see Douven and Romeijn, 2009, 14).Bradley, Douven, and Romeijn use a notion frequently used and introduced bythe AGM literature to capture formally analogous structures in probability theory.The question is how compatible the use of epistemic entrenchment in probabilisticbelief revision (probability kinematics) is with pme. pme appears to assign proba-bility distributions to events without any heed to epistemic entrenchment. The JudyBenjamin problem is a case in point. pme’s posterior probabilities are somewhere inbetween the possible epistemic entrenchments, as though mediating between them,but they affix themselves to a determinate position (which in some quarters raisesworries analogous to excessive apriorism).Epistemologists are generally agreed that pme is not needed because they expectpragmatic latitude in addressing questions of belief revision. The full employmenttheorem prefers a wide array of updating methods to a reduction and narrowing ofchoices. As there is already widespread consensus that objectivism cannot provide aunique set of absolutely prior probabilities, nobody sees any reason that objectivismshould succeed in probability kinematics. The problem with this position is that thewide array of updating methods systematically leads to solutions which contradictthe principle that a rational agent uses relevant information and does not gain un-warranted information, and that for most constraints this principle leads to a uniquesolution optimally fulfilling it.In the quest to undermine uniqueness, independence assumptions often sneak inthrough the back door where they are indefensible (see section 7.3). Well-posed prob-lems are dismissed as ambiguous (e.g. the Judy Benjamin problem), while problemsthat are overdetermined may be treated as open to a whole host of solutions becausethey are deemed underdetermined (e.g. von Mises’ water and wine paradox). Ad hocupdating methods proliferate which can often be subsumed under pme with littlemathematical effort (e.g. Wagner’s Linguist problem and its treatment in chapter 5).36Chapter 3Changing Partial Beliefs UsingInformation Theory3.1 The ModelThis chapter describes in more detail how probabilities are updated using informationtheory. First I distinguish between full beliefs and partial beliefs. A full belief formalepistemology, such as in Spohn, 2012, provides an account of what it means foran agent to accept a proposition X, how this acceptance is quantitatively related toother beliefs, as for example in ranking theory, and how it changes with new evidence.Partial belief formal epistemology provides this account using a function that assignsto some propositions a non-negative real number representing partial belief.In this dissertation, I am limiting myself to finite outcome spaces and finitepropositional languages. Let Ω be a finite outcome space with |Ω| = n. SinceΩ is finite, there is no issue with ill-behaved power sets and no need to use a σ-algebra instead, so let S be the power set of Ω. A partial belief function B hasas its domain a subset S ⊆ S and as its range the non-negative real numbers.Assuming probabilism, B has the following properties: (P1) B(Ω) = 1; and (P2)if X ∩ Y = ∅ then B(X ∪ Y ) = B(X) + B(Y ). Dempster-Shafer belief functionsare an alternative to probabilistic belief functions (see Dempster, 1967). Since Iwill almost exclusively talk about probabilistic belief functions from now on, I willuse the mathematical notation P instead of B. Instead of sets, I will sometimesapply P to propositions, assuming an isomorphism between the outcome space anda corresponding propositional language for which the proposition X˜ ∨ Y˜ correspondsto the set X ∪ Y , the proposition X˜ ∧ Y˜ corresponds to the set X ∩ Y , and the373.1. The Modelproposition qX˜ corresponds to −X.There are special probabilistic partial belief functions PY called conditional prob-ability functions which have the additional property that PY (Y ) = 1. We usuallywrite PY (X) = P (X|Y ). For Bayesians,P (X|Y ) = P (X ∩ Y )P (Y ). (3.1)Bayes’ formula is a direct consequence of (3.1), provided P (Y ) 6= 0,P (Y |X) = P (X|Y )P (Y )P (X). (3.2)I am not concerned here with absolutely prior beliefs, so for the rest of this disserta-tion the assumption is that there is a belief function P , wherever it may have comefrom, and that my interest is focused on how to update it to a belief function P ′,given some intervening evidential input. For this evidential input, I restrict myselfto the consideration of constraints in the formn∑i=1cijP′(Xi) = bj (3.3)for an n × k coefficient matrix C = (cij) and a k-dimensional vector b. I will givea few examples first and then explain why constraints in this form are called affineconstraints and why non-affine constraints would introduce problems which are wiseto bracket for the moment.The best-known example for an affine constraint is standard conditioning. Insubsection 4.2.3, example 11 tells the story of a criminal case in which SherlockHolmes’ probability distribution over three people is383.1. The ModelP (E1) = 1/3, P (E2) = 1/2, P (E3) = 1/6 (3.4)where E1 is the proposition that the first person committed the crime. Let us sayHolmes finds out that the third person did not commit the crime. The updating ofP (E1) and P (E2) in light of P (E3) = 0 is called standard conditioning. Holmes’evidence in this case is k = 1, b1 = 0 andC = 001 (3.5)Another well-known example for an affine constraint different from standard con-ditioning is a Jeffrey-type updating scenario. If Holmes’ first person is male and theother two persons are female, then Holmes may find out that the probability of theculprit being male is 1/2 after holding the relatively prior probabilities in (3.4). Heneeds to change P (E1) = 1/3 to P′(E1) = 1/2. In the language of (3.3), his evidenceis k = 1, b1 = 0.5 andC = 100 (3.6)A third example of an affine constraint, different from both standard and Jeffrey-type updating scenarios, is illustrated by Jaynes’ Brandeis Dice problem introducedin his 1962 Brandeis lectures.Example 8: Brandeis Dice. When a die is rolled, the number of spots up canhave any value i in 1 ≤ i ≤ 6. Suppose a die has been rolled many times and we aretold only that the average number of spots up was not 3.5 as we might expect from393.1. The Modelan ‘honest’ die but 4.5. Given this information, and nothing else, what probabilityshould we assign to i spots on the next roll? (See Jaynes, 1989, 243.)Jaynes considers the constraint in this case to be the mean value constraint6∑i=1iP ′(Xi) = 4.5 (3.7)where Xi is the proposition that i spots will be up on the next roll. In the languageof (3.3), the evidence is k = 1, b1 = 4.5 andC = (1, 2, 3, 4, 5, 6)ᵀ. (3.8)The disjunctive normal form theorem in logic guarantees that any propositionin my finite propositional language is expressible by a conjunction of atomic propo-sitions (corresponding to an atomic outcome ω ∈ Ω) and their negations. Conse-quently, any probabilistic belief function P is determined by its values P (ω) on theatoms of the outcome space (see Paris, 2006, 13). This means that every probabilisticbelief function is uniquely represented by a point on the n− 1-dimensional simplexSn−1 ⊂ Rn in the following way: if P (ωi) = xi for i = 1, . . . , n then this probabilisticbelief function corresponds to X = (x1, . . . , xn) in Rn withn∑i=1xi = 1. (3.9)(3.9) is the identifying characteristic for points on the simplex Sn−1. Statisticiansusually want to model these cases using parameters. For finite outcome spaces, themodel has n− 1 parameters, for example x2, . . . , xn. x1 is then determined to be403.1. The Modelx1 = 1−n∑i=2xi. (3.10)The statistical model, an affine constraint, and an interesting ambivalence in updat-ing methods is illustrated in figures 3.1 and 3.2.0yzx(0, y + z)(y, z)(0, 1)(1, 0)Figure 3.1: The statistical model of the 2-dimensional simplex S2.y and z are the parameters, whereas x is fully determined by y and z.x, y, z represent the probabilities P (X1), P (X2), P (X3) on an outcomespace Ω = {ω1, ω2, ω3}, where Xi is the proposition corresponding tothe atomic outcome ωi. The point (y, z) corresponds to the numbersin example 11.As long as constraints of the form (3.3) are consistent, they identify a subsetof the simplex Sn−1 which is an affine subset. That is why they are called affine413.1. The Model0x′(y′1, z′1)(y′2, z′2)(y, z)(0, 1)(1, 0)Figure 3.2: Illustration of an affine constraint. This is a Jeffrey-type updating scenario. While x = 1/3 as in figure 3.1, the updatedx′ = 1/2. These are again the numbers of example 11. There ap-pear to be two plausible ways to update (y, z). One way, which wewill call Jeffrey conditioning, follows the line to the origin of the co-ordinate system. It accords with our intuition that as x′ → 1, theupdated (y′1, z′1) is continuous with standard conditioning. The otherway, which in chapter 4 I will call LP conditioning, uses the short-est geometric distance from (y, z) to determine (y′2, z′2). This case isillustrated in three dimensions (not using statistical parameters) infigure 4.4. One of the main points of this dissertation is that the am-bivalence between updating methods can be resolved by using cross-entropy rather than Euclidean distance. Jeffrey conditioning gives usthe updated probability distribution which minimally diverges fromthe relatively prior probability distribution in terms of cross-entropy.constraints. The affine nature of the subset guarantees that the subset is closed andconvex (see Paris, 2006, 66).423.1. The ModelExample 9: More Than One Half. A coin for which I have excellent evidencethat it is fair is about to be flipped. My prior probability that the coin will land tailsis 0.5. A demon, whom I inductively know to be reliable about the future chances ofevents, tells me that on the next flip the probability for tails is greater than 0.5 (seeLandes and Williamson, 2013, 3557).If the constraint is non-affine, as in example 9, we have no such guarantees. Lan-des and Williamson worry about this as irrationality in an otherwise solid defenceof pme (see Landes and Williamson, 2013; for more detail on affine constraints seeCsisza´r, 1967; Williams, 1980, 137; Howson and Franklin, 1994, 456; Uffink, 1996, 7;Paris, 2006, 66; Debbah and Mu¨ller, 2005, 1673; Williamson, 2011, 5; and Paul Pen-field’s lecture notes http://www-mtl.mit.edu/Courses/6.050/2013/notes, section 9.4).I do not address this issue here in any more depth. A comprehensive defence of pmemay need to close this gap.Note that some epistemologists defend a position which requires all affine con-straints to be solved by standard conditioning (for example Jan van Campenhoutand Thomas Cover, who consider pme a special case of Bayes’ Theorem, see 1981;Skyrms, 1985, which van Fraassen considers “the most sensitive treatment so far” inhis 1993, 313; Skyrms, 1986, 237; Howson and Franklin, 1994, 461; and Grove andHalpern, 1997, making this case for the Judy Benjamin problem treated in chapter 7below).An affine constraint makes it possible to find points in the subset that are min-imally distant from the point corresponding to the prior probability distribution.Naively, we might use the Euclidean distance to determine the minimally distantpoint (for example in Joyce, 1998; or Leitgeb and Pettigrew, 2010a). Chapter 4shows that such a procedure would put us at odds with a host of reasonable in-tuitions. In the next subsection, I provide a sketch of the information-theoreticapproach.433.2. Information Theory3.2 Information Theory3.2.1 Shannon EntropyInformation theory is rooted in the consideration of communication over noisy chan-nels. Since most channels transmit information with errors, before Shannon theprevailing thought was that in order to reduce the probability of error, redundancyneeded to be increased. No pain, no gain. Shannon’s noisy-channel coding theoremproves this intuition to be false. Information can be communicated over a noisychannel at a non-zero rate with arbitrarily small error probability. The maximumrate for which this is possible is called the capacity of the channel (see MacKay,2003, 14f).To prove this remarkable result and calculate the capacity of a channel, Shannonneeds a measure of entropy for a probability distribution. Consider the writtenEnglish language and its distribution of Latin letters: the letter E occurs 9.1% ofthe time, whereas the letter P only occurs 1.9% of the time. There is a certaininefficiency in coding a message by giving as much space to symbols that containless information than others. The P contains more information than an E, and ingeneral consonants contain more information than vowels. Codes in which there arefewer of these inefficiencies have higher entropy. Entropy is therefore based on theprobability distribution of the code symbols.Let Xn be an ensemble (x,AXn , PXn) (PXn will be abbreviated Pn in the follow-ing), where x is the outcome of a random variable, Pn is the probability distribution,and the outcome set AXn is an alphabet with n letters. We want a measure of en-tropy Hn that has the following properties (see Khinchin, 1957, for these axioms; Igleaned them from Guias¸u, 1977, 9):(K1) For any n, the function Hn(Pn(x1), . . . , Pn(xn)) is a continuous and symmetricfunction with respect to all of its arguments.(K2) For every n, we have Hn+1(Pn(x1), . . . , Pn(xn), 0) = Hn(Pn(x1), . . . , Pn(xn)).(K3) For every n, we have the inequalityHn(Pn(x1), . . . , Pn(xn)) ≤ Hn(1/n, . . . , 1/n).443.2. Information Theory(K4) If the conditions in (3.11) hold for pikl, the consequent in (3.12) is true.For K4, the conditions arepikl ≥ 0,n∑k=1mk∑l=1pikl = 1, Pn(xk) =mk∑l=1pikl,M =n∑k=1mk (3.11)and the consequent isHnM(pi11, . . . , pinmn) =Hn(Pn(x1), . . . , Pn(xn)) +n∑k=1Pn(xk)Hmk(pik1Pn(xk), . . . ,pikmkPn(xk)).(3.12)K1–3 are self-explanatory. K4 requires that the entropy of a product code isobtained by adding the entropy of one component to the entropy of the other com-ponent conditioned by the first one. Given Khinchin’s axioms, theorem 1.1 in SilviuGuias¸u’s book proves thatHn(Pn(x1), . . . , Pn(xn)) = −cn∑k=1Pn(xk) logPn(xn) (3.13)Leaving off the subscript n (which is predetermined by the cardinality of the alpha-bet) and taking c = 1,H(P (x1), . . . , P (xn)) = −n∑k=1P (xk) logP (xn) (3.14)is called the Shannon entropy (see Shannon, 1948).453.2. Information Theory3.2.2 Kullback-Leibler DivergenceI could use the Shannon entropy as a first pass to isolate a probability distributionfrom a candidate set determined by an affine constraint.original pme A rational agent accepts for her partial beliefs a probability distributionthat satisfies evidential constraints and has maximal Shannon entropy.The problem with original pme is that it does not take into account a relativelyprior probability distribution and is therefore inconsistent with the former of the twobasic commitments of Bayesians: the partial beliefs of a rational agent (i) dependon a prior probability and (ii) are obtained by standard conditioning if applicable.original pme does not make reference to a prior probability and is therefore un-suitable from a Bayesian point of view.In order to fix this, I need the concept of cross-entropy. Whereas maximizing theShannon entropy gives us the most efficient code given a set of constraints (if they areaffine constraints, this code will be uniquely distributed), minimizing cross-entropygives us the code with respect to which the relatively prior probability distributionis most efficient. Cross-entropy measures how much information is lost when a mes-sage is coded using a code optimized for the relatively prior probability distributionrather than the ‘true’ posterior distribution. When I pick a posterior probabilitydistribution according to the principle of minimum cross-entropy or Infomin, I pickthe probability distribution that accords with my constraints and makes the rela-tively prior distribution look as efficient as possible. This is analogous to Bayesianconditionalization, where I choose a posterior probability distribution which strikesa balance between respecting the prior and respecting the new evidence.To implement Infomin, I need a measure for cross-entropy. Again, as in K1–4, I first formulate a set of desiderata for this function that maps two probabilitydistributions Pn and Qn to the set of real numbers. A caveat here: the cross-entropyis conceptualized as a divergence of Pn from Qn so that in the logic of updating Qnis the relatively prior probability distribution and Pn is the posterior. I shall use thegeneric label D(Pn, Qn) for this measure in the desiderata. I shall also sometimes463.2. Information Theorywrite D(Pn(x1), . . . , Pn(xn);Qn(x1), . . . , Qn(xn)) for D(Pn, Qn). This time, I gleanedthe desiderata from Paris, 2006, 121; the original proof is in Kullback and Leibler,1951.(KL1) For each n, D(Pn, Qn) is continuous, provided that Q(xi) 6= 0 if P (xi) 6= 0.(KL2) For 0 < n ≤ n0, D(1/n, . . . , 1/n, 0, . . . , 0; 1/n0, . . . , 1/n0)) is strictly decreas-ing in n and increasing in n0 and is 0 if n = n0.(KL3) Provided that Q(xi) 6= 0 if P (xi) 6= 0, and given a permutation σ of (1, . . . , n),the consequent (3.15) holds.(KL4) If the conditions in (3.11) hold for pikl and analogous conditions hold for %kl(replace P by Q), then the consequent in (3.16) is true, provided that %kl 6= 0if pikl 6= 0.For KL3, the consequent isD(Pn(x1), . . . , Pn(xn);Qn(x1), . . . , Qn(xn)) =D(Pn(xσ(1)), . . . , Pn(xσ(n));Qn(xσ(1)), . . . , Qn(xσ(n)))(3.15)For KL4, the consequent isD(pi11, . . . , pinmn ; %11, . . . , %nmn) =D(Pn(x1), . . . , Pn(xn);Qn(x1), . . . , Qn(xn))+n∑k=1Pn(xk)D(pik1Pn(xk), . . . ,pikmkPn(xk);%k1Pn(xk), . . . ,%kmkPn(xk)).(3.16)These desiderata give us the solutionD(Pn, Qn) = cn∑i=1Pn(xi) logPn(xi)Qn(xi)(3.17)473.2. Information Theorywhich is fulfilled by the Kullback-Leibler divergence DKL, again setting c = 1,DKL(Pn, Qn) =n∑i=1Pn(xi) logPn(xi)Qn(xi). (3.18)In section 2.2, I have articulated a version of pme to be added to classical Bayesiancommitments. pme seeks to be sensitive to the intuition that we ought not to gaininformation where the additional information is not warranted by the evidence. Somewant to drive a wedge between the synchronic rule to keep the entropy maximal(pme) and the diachronic rule to keep the cross-entropy minimal (Infomin) (for thisobjection see Walley, 1991, 270f). Here is a brief excursion to dispel this worry.Example 10: Piecemeal Learning. Consider a bag with blue, red, and greentokens. You know that (C ′) at least 50% of the tokens are blue. Then you learn that(C ′′) at most 20% of the tokens are red.The synchronic norm, on the one hand, ignores the diachronic dimension andprescribes the probability distribution which has the maximum entropy and obeysboth (C ′) and (C ′′). The three-dimensional vector containing the probabilities forblue, red, and green is (12, 15, 310). The diachronic norm, on the other hand, processes(C ′) and (C ′′) sequentially, taking in its second step (12, 14, 14) as its prior probabilitydistribution and then diachronically updating to ( 815, 15, 415).The information provided in a problem calling for the synchronic norm and theinformation provided in a problem calling for the diachronic norm is different, astemporal relations and their implications for dependence between variables clearlymatter. In example 10, I might have relevantly received information (C ′′) before(C ′) (‘before’ may be understood logically rather than temporally) so that Infominupdates in its last step (25, 15, 25) to (12, 16, 13). Even if (C ′) and (C ′′) are received ina definite order, the problem may be phrased in a way that indicates independencebetween the two constraints. In this case, the synchronic norm is the appropriatenorm to use. Infomin does not assume such independence and therefore processesthe two pieces of information separately. Disagreement arises when observations483.2. Information Theoryare interpreted differently, not because pme and Infomin are inconsistent with eachother.The fault line for commutativity is actually located between standard condition-ing (which is always commutative) and Jeffrey conditioning (which is sometimesnon-commutative). Huttegger explains:This has caused some concerns as to whether probability kinematics is a rational wayto update beliefs (Do¨ring, 1999; Lange, 2000; and Kelly, 2008). I agree with Wagnerand Joyce that these concerns are misguided (see Wagner, 2002; and Joyce, 2009). AsJoyce points out, probability kinematics is non-commutative exactly when it shouldbe; namely, when belief revision destroys information obtained in previous updates.(Huttegger, 2015, 628.)In the following, I will assume that pme and Infomin are compatible and part of thetoolkit at the disposal of pme.The first question to ask is if pme is compatible with the Bayesian commitment tostandard conditioning. The answer is yes. Better yet, pme also agrees with Jeffrey’sextension to standard conditioning, now commonly called Jeffrey conditioning.A proof that pme generalizes standard conditioning is in Williams, 1980. Aproof that pme generalizes Jeffrey conditioning is in Jeffrey, 1965. I will give myown simple proofs here that are more in keeping with the notation in the body ofthe dissertation. An interested reader can also apply these proofs to show that pmegeneralizes Wagner conditioning, but not without simplifications that compromisemathematical rigour. The more rigorous proof for the generalization of Wagnerconditioning is in section 5.4.I assume finite (and therefore discrete) probability distributions. For countableand continuous probability distributions, the reasoning is largely analogous (for anintroduction to continuous entropy see Guias¸u, 1977, 16ff; for an example of howto do a proof of this section for continuous probability densities see Caticha andGiffin, 2006, and Jaynes, 1978; for a proof that the stationary points of the Lagrangefunction are indeed the desired extrema see Zubarev et al., 1974, 55, and Cover andThomas, 2006, 410; for the pioneer of the method applied in this section see Jaynes,493.2. Information Theory1978, 241ff). Before I address standard conditioning in subsection 3.2.4 and Jeffreyconditioning in subsection 3.2.5, I will present Jaynes’ method in the next subsection.3.2.3 Constraint Rule for Cross-EntropyThis subsection provides a general solution for probability kinematics using affineconstraints, including standard conditioning and Jeffrey conditioning. Jaynes pro-vides such a solution for the maximum entropy problem, best exemplified by theBrandeis Dice problem (see Jaynes, 1989, 243, and example 8). Jaynes’ solution willnot serve us because I have restricted myself to updating from relative priors ratherthan ignorance priors. Rather than maximizing the Shannon entropy, I will give theprescription for minimizing the Kullback-Leibler divergence. Jaynes’ constraint rulecan still be recovered by the case when the relative priors are equiprobabilities.The rule is general and therefore abstract. I will make it more concrete by showinghow it applies to the Judy Benjamin problem, which is described in more detail insection 7.1, and the story is provided in example 38. In brief, Judy Benjamin’srelatively prior probabilities are P (A1) = 0.25, P (A2) = 0.25, P (A3) = 0.5. For herposterior probabilities, the affine constraint is that P ′(A2) = 3 · P ′(A1). In thissubsection, I will abbreviate P (Ai) = pi and P′(Ai) = qi. In terms of the modelintroduced in section 3.1, k = 1, b1 = 1 and C = (4, 0, 1)ᵀ (this is not the only model;k = 1, b1 = 0 and C = (−3, 1, 0)ᵀ works as well). Thus, using the Brandeis Diceproblem as a template, the Judy Benjamin problem is equivalent to a dice problemwhere the relative prior probabilities for a three-sided dice are (0.25, 0.25, 0.5), thesides have 4 dots, 0 dots, and 1 dot respectively, and the average number of spotsup in a long-running experiment is not 1.5 but 1 instead.Let qi be a posterior probability distribution on a finite space that fulfills theconstraintn∑i=1ciqi = b. (3.19)503.2. Information TheoryBecause qi is a probability distribution it fulfillsn∑i=1qi = 1 (3.20)I want to minimize the Kullback-Leibler divergence from P to P ′, given the con-straints (3.19) and (3.20),n∑i=1qi logqipi(3.21)I use Lagrange multipliers to defineJ(q) =n∑i=1qi logqipi+ λ0n∑i=1qi + λ1n∑i=1ciqi (3.22)and differentiate it with respect to qi∂J∂qi= log qi − log pi + 1 + λ0 + λ1ci (3.23)Set (3.23) to 0 to find the necessary condition to maximize (3.21).qi = elog pi−λ0−1−λ1ci (3.24)This is a combination of the Boltzmann distribution and the weight of the priorprobabilities supplied by log pi. I still need to do two things: (a) show that theKullback-Leibler divergence is minimal, and (b) show how to find λ0 and λ1. (a) isshown in Theorem 12.1.1 in Cover and Thomas (2006) and there is no reason to copyit here.513.2. Information TheoryFor (b), letβ = −λ1 (3.25)Z(β) =n∑i=1pieβci (3.26)λ0 = log(Z(β))− 1 (3.27)To find λ0 and λ1, introduce the constraint∂∂βlog(Z(β)) = b. (3.28)For the Judy Benjamin problem, substituting x = eβ and filling in the known vari-ables yields the equationx4 =43(3.29)with β = log x. Use (3.25) and (3.27) to calculate λ1 = 0.27465 and λ0 = 1.3379.Now use (3.24) to calculate the posterior probabilities q1 = 0.117, q2 = 0.351, q3 =0.533. This result agrees with the direct method used in section 7.2 and the resultobtained in (7.3).Although it is good to know that all is well numerically, I am interested in ageneral proof that this choice of λ0 and λ1 will give me the probability distributionminimizing the cross-entropy. That q so defined minimizes the entropy is shown in(a). I need to make sure, however, that with this choice of λ0 and λ1 the constraints523.2. Information Theory(3.19) and (3.20) are also fulfilled.First,n∑i=1qi =n∑i=1elog pi−λ0−1−λ1ci = e−λ0−1n∑i=1e−λ1ci =e− log(Z(β))Z(β) = 1(3.30)Then, usingbn∑i=1pieβci =n∑i=1picieβci (3.31)on account of (3.28),n∑i=1ciqi =n∑i=1cielog pi−λ0−1−λ1c1 =1Z(β)n∑i=1cipieβc1 =1Z(β)bn∑i=1pieβci = b.(3.32)A proof with similar intentions as this subsection is in Paul Penfield’s lecture noteshttp://www-mtl.mit.edu/Courses/6.050/2013/notes, subsection 9.6.3. Exercise12.2. (see Cover and Thomas, 2006, 409) is nicely suggestive of the combinationbetween weight of prior probabilities and Jaynes’ constraint manifest in (3.24).3.2.4 Standard ConditioningLet yi (all yi 6= 0) be a finite relatively prior probability distribution summing to1, i ∈ I. Let yˆi be the posterior probability distribution derived from standardconditioning with yˆi = 0 for all i ∈ I ′ and yˆi 6= 0 for all i ∈ I ′′, I ′ ∪ I ′′ = I. I ′ andI ′′ specify the standard event observation. Standard conditioning requires that533.2. Information Theoryyˆi =yi∑k∈I′′ yk. (3.33)To solve this problem using pme, I want to minimize the cross-entropy with theconstraint that the non-zero yˆi sum to 1. The Lagrange function is (writing in vectorform yˆ = (yˆi)i∈I′′)Λ(yˆ, λ) =∑i∈I′′yˆi lnyˆiyi+ λ(1−∑i∈I′′yˆi). (3.34)Differentiating the Lagrange function with respect to yˆi and setting the result to zerogives usyˆi = yieλ−1 (3.35)with λ normalized toλ = 1− ln∑i∈I′′yi. (3.36)(3.33) follows immediately. pme generalizes standard conditioning.3.2.5 Jeffrey ConditioningFor this subsection, I am using a special notation that I develop in section 5.4. Eventhough thematically this subsection belongs here, it depends on understanding thenotation introduced in chapter 5. The reader may want to skip this subsection andreturn to it later. The conclusion stands that pme generalizes Jeffrey conditioning.Let θi, i = 1, . . . , n and ωj, j = 1, . . . ,m be finite partitions of the event space543.2. Information Theorywith the joint prior probability matrix (yij) (all yij 6= 0). Let κ be defined as insection 5.4, with (5.11) true (noting that in section 5.4, (5.11) is no longer required).Let P be the relatively prior probability distribution and Pˆ the posterior probabilitydistribution.Let yˆij be the posterior probability distribution derived from Jeffrey conditioningwithn∑i=1yˆij = Pˆ (ωj) for all j = 1, . . . ,m (3.37)Jeffrey conditioning requires that for all i = 1, . . . , nPˆ (θi) =m∑j=1P (θi|ωj)Pˆ (ωj) =m∑j=1yijP (ωj)Pˆ (ωj) (3.38)Using pme to get the posterior distribution (yˆij), the Lagrange function is (writingin vector form yˆ = (x11, . . . , xn1, . . . , xnm)> and λ = (λ1, . . . , λm)>)Λ(yˆ, λ) =n∑i=1m∑j=1yˆij lnyˆijyij+m∑j=1λj(Pˆ (ωj)−n∑i=1yˆij). (3.39)Consequently,yˆij = yijeλj−1 (3.40)with the Lagrangian parameters λj normalized byn∑i=1yijeλj−1 = Pˆ (ωj) (3.41)553.3. Pragmatic, Axiomatic, and Epistemic Justification(3.38) follows immediately. pme generalizes Jeffrey conditioning.3.3 Pragmatic, Axiomatic, and EpistemicJustification3.3.1 Three ApproachesThere are various ways in which an explanatory theory of a rational agent’s partialbeliefs can be justified. The rational agent is a relevant idealization, relevant in thesense that there may be a different explanatory theory of an actual effective humanreasoner’s partial beliefs. The rational agent models an inductive logic, not dissimilarto the inference relationships that model deductive logic and may also be relevantidealizations of effective human deductive reasoning.In this dissertation, I am assuming that a rational agent already has quantitativepartial beliefs (also called credences) about certain propositions and that she reasonsas a Bayesian, i.e. her credences are probabilistic (obey Kolmogorov’s axioms) andshe updates them using standard conditioning whenever standard conditioning is ap-plicable. This assumption is controversial. There are a few places in this dissertationwhere I provide some justification for them, but the focus is the next step once theBayesian elements are already in place.In later chapters, I shall defend the thesis that rational agents (i) have sharpcredences and (ii) update them in accordance with norms based on and in agreementwith information theory. Neither sharp credences nor information-based update is anecessary feature of recent Bayesian epistemology. On the contrary, most Bayesiansnow appear to prefer indeterminate credal states and an eclectic approach to updat-ing. This section sets the stage by examining what a defence of my claims needs todo in order to be successful.Partial belief epistemologists have varying preferences about what justifies theirtheories. For ease of exposition, I am condensing them to three major approaches:the psychological, the formal, and the philosophical. My hope is that a defence of563.3. Pragmatic, Axiomatic, and Epistemic Justification(i) and (ii) succeeds on all three levels. Since all three approaches have explanatoryvirtue, one would also hope that they converge and settle on compatible solutionsto the question of which partial beliefs rational agents entertain. (Ju¨rgen Landesand Jon Williamson similarly show how three norms converge in justifying pme: aprobability norm, a calibration norm, and an equivocation norm, see Landes andWilliamson, 2013.) Information theory and its virtues across all three levels supportthis convergence.I now turn to a more detailed description of what I mean by the psychological, theformal, and the philosophical approaches, which I cash out as providing pragmatic,axiomatic, and epistemic justification for the partial beliefs of a rational agent.3.3.2 The Psychological ApproachThe psychological approach justifies partial beliefs by highlighting their pragmaticvirtues. Frank Ramsey and Bruno de Finetti were the first to substantiate thisapproach (see Ramsey, 1926; and de Finetti 1931). A system of partial beliefs corre-sponds to a willingness to make decisions in the light of these beliefs, most helpfullyto purchase a bet on a proposition p costing $x with a return of $1 if the propositionturns out to be true and no return if the proposition turns out to be false.Both synchronic and diachronic Dutch-book arguments demonstrate the con-straints that rational agents have with respect to their partial beliefs lest they fallprey to arbitrage opportunities. The most vocal critics of the synchronic argumentare Colin Howson and Peter Urbach (loc. cit., but see also Earman, 1992, 38f).The diachronic arguments, however, are significantly more controversial and lesssupported in the literature than the synchronic arguments originally formulated byde Finetti (for the diachronic argument supporting standard conditioning see Ar-mendt, 1980; and Lewis, 2010; for criticism see Howson and Franklin, 1994, 458).Jon Williamson, interestingly, thinks that the diachronic argument fails because itsupports Bayesian conditionalization against pme, with which it is incompatible andto which it must cede (see Williamson, 2011). The idea that Bayesian conditional-ization and pme conflict can be found in other places as well (see Seidenfeld, 1979,573.3. Pragmatic, Axiomatic, and Epistemic Justification432f; Shimony, 1985; van Fraassen, 1993, 288ff; Uffink, 1995, 14; Howson and Ur-bach, 2006, 278; and Neapolitan and Jiang, 2014, 4012), but the usual interpretationis that the conflict undermines pme. This dissertation defends the claim that theydo not conflict at all.3.3.3 The Formal ApproachThe formal approach justifies partial beliefs by providing an axiomatic system whichcoheres with intuitions we have about partial beliefs and limits the partial belieffunctions fulfilling the formal constraints. The pioneer of this method is Richard Coxin his landmark article “Probability, Frequency and Reasonable Expectation.” Theidea is simple and can be surprisingly powerful: formalize reasonable expectationsand apply them to yield unique solutions, so that for example only partial beliefsfulfilling Bayesian requirements also fulfill the reasonable expectations. The idea iscontroversial, especially when it leads to unique solutions.It is often possible to weaken the assumptions such that families of solutions,rather than unique solutions, fulfill the axioms. This is the case for Carnap’s conti-nuum of inductive methods (see Carnap, 1952) and the Re´nyi entropy, which gener-alizes Shannon’s entropy and yields families of credence functions fulfilling maximumentropy requirements (see Uffink, 1995, who criticizes and weakens the assumptionsof Shore and Johnson, 1980, a paradigmatic example for the formal approach sup-porting pme; and Huisman, 2014, who uses this idea to solve the Judy Benjaminproblem in chapter 7 with indeterminate credal states).There is a sense in which both the psychological and the philosophical approachare formal approaches as well, starting with a set of axioms and identifying thecredence functions that fulfill them. I am setting the formal approach aside when itappears that the uniqueness result is kept in mind as a trajectory in formulating theaxioms. Often, there is resistance to such a manipulation of the initial conditions inorder to achieve the final result of uniqueness (see the above comment on weakeningof the axioms). The formal approach, however, considers uniqueness itself a desiredoutcome with justificatory force. There is something at least interesting about the583.3. Pragmatic, Axiomatic, and Epistemic Justificationset of axioms that gives us a unique solution, even though each axiom still needs topass the rigorous test of plausibility compared to its alternatives.For Ramsey and de Finetti, the salient axiom is invulnerability to arbitrage op-portunities and does decidedly not yield uniqueness. Subjective probabilities canrange wildly between rational agents, as long as they adhere to the laws of prob-ability. Bas van Fraassen argues that calibration requirements mandate the use ofstandard conditioning once a relatively prior probability distribution is in place (seevan Fraassen, 1989). The calibration requirement is both psychological and formalin the sense that it is based on pragmatic considerations about an agent’s psychologyand behaviour, especially her need to have beliefs and states of the world calibrated,but also on a formal apparatus that gives us a unique result for conditioning onnew information. Joyce provides the philosophical counterpart (see Joyce, 1998),where the salient axiom is gradational accuracy (which is philosophical rather thanpsychological in its nature) and again standard conditioning follows as the uniquelymandated updating method fulfilling the axiom.In a Jeffrey-type updating scenario, where standard conditioning cannot be ap-plied, there are similar attempts to isolate a unique updating method. This method isusually Jeffrey conditioning, for which there are dynamic coherence arguments (seeArmendt, 1980; Goldstein, 1983; and Skyrms, 1986), met with critical resistancein the literature (see Levi, 1987; Christensen, 1999; Talbott, 1991; Maher, 1992;and Howson and Urbach, 2006). Remarkably, Joyce’s norm of gradational accuracymandates a different unique updating method in Jeffrey-type updating scenarios, LPconditioning (see Leitgeb and Pettigrew, 2010b). I will look at this anomaly in greatdetail in chapter 4.Sometimes neither standard conditioning nor Jeffrey conditioning can be applied,even though the constraint is affine. In this case, I can develop a justification of pmeusing the axiomatic approach (see Shore and Johnson, 1980; Tikochinsky et al.,1984; and Skilling, 1988). Criticism of pme focuses either on the strength of theassumptions (and weakening them, as in Uffink, 1996) or on deductive implicationsof the axioms that are inconsistent with epistemic intuitions we have about particularcases (the paradigm case is the Judy Benjamin problem, see van Fraassen, 1981).593.3. Pragmatic, Axiomatic, and Epistemic JustificationThe following quotes illustrate the skepticism entertained by many about pme asa general updating procedure for affine constraints:There has been considerable interest recently in maximum entropy methods, especiallyin the philosophical literature. Example 5.1 suggests that any claims to the effect thatmaximum-entropy revision is the only correct route to probability revision should beviewed with considerable caution because of its strong dependence on the measure ofcloseness being used. (Diaconis and Zabell, 1982, 829.)It will come as no surprise to those who have studied the relation of maxent toconditionalization in a larger space that there are many strategies which conflict withmaxent and yet satisfy these conditions for coherence. (Skyrms, 1986, 241.)None of the arguments for the PME, when regarded as a general method for generatingprecise probabilities, is at all compelling. (Walley, 1991, 271.)The fact that Jeffrey’s rule coincides with maxent is simply a misleading fluke, putin its proper perspective by the natural generalization of Jeffrey conditionalizationdescribed in this paper. (Wagner, 1992, 255.)We saw that simple conditionalization is actually the rule enjoined by the principleof minimum information in such circumstances. Furthermore, not to use the ruleof Bayesian conditionalization, but some other rule, like the principle of minimuminformation with a uniform prior and constraints in the form of expectation values,actually entails inconsistency, i.e. incoherence. (Howson and Franklin, 1994, 465.)Maximum entropy is also sometimes proposed as a method for solving inference prob-lems . . . I think it is a bad idea to use maximum entropy in this way; it can give verysilly answers. (MacKay, 2003, 308.)Maximum entropy and relative entropy have proved quite successful in a number ofapplications, from physics to natural-language modeling. Unfortunately, they alsoexhibit some counterintuitive behavior on certain applications. Although they arevaluable tools, they should be used with care. (Halpern, 2003, 110.)603.3. Pragmatic, Axiomatic, and Epistemic Justification[pme] essentially never gives the right results . . . it is likely to give highly misleadinganswers. (Gru¨nwald and Halpern, 2003, 243.)Given the other problematic features of conditionalization we pointed to in Chapter3, we feel that in linking its fortunes to the principle of minimum information noreal advance has been made in justifying its adoption as an independent Bayesianprinciple. (Howson and Urbach, 2006, 287f.)It is notoriously difficult to defend general procedures for directly updating credenceson constraints. (Moss, 2013, 7.)My dissertation disagrees with these assessments. The axioms which require aunique updated probability distribution that is minimally informative with respectto the prior probability distribution are defensible and do not lead to incoherence.3.3.4 The Philosophical ApproachThe philosophical approach justifies partial beliefs by highlighting epistemic virtuesof partial beliefs, contrasted usually with pragmatic virtues. The idea is that thereis independent appeal in having doxastic states that are as close as possible to thetruth, no matter what their pragmatic consequences are. The manifesto of thisapproach is James Joyce’s article “A Nonpragmatic Vindication of Probabilism.” Infull belief epistemology, epistemic virtue consists in the balance between believingas many truths and disbelieving as many falsehoods as possible. A trade-off may beinvolved between valuing a low error ratio and valuing a high number of beliefs (onwhich one can act, for example, and be better safe than sorry, even when the beliefis false, see Stephens, 2001).The point of this approach, however, is that pragmatic considerations take theback seat and give priority to a rational agent’s desire to get as close as possibleto reflecting the state of the world in her epistemic state. Joyce’s Norm of Grada-tional Accuracy, which generalizes the full belief norm of accuracy to partial beliefs,succeeds in affording us a system of requirements with substantial epistemic im-plications, such as probabilism, Bayes’ formula, perhaps a principle of indifference,613.3. Pragmatic, Axiomatic, and Epistemic Justificationand other forms of conditioning (see Greaves and Wallace, 2006; and Leitgeb andPettigrew, 2010a).As different as the three approaches are, they intersect on a significant amount ofcommon terrain. Rational agents are subject to norms in their partial beliefs becausewe have intuitions about deficient partial beliefs. It is not the laws of logic whichcircumscribe these norms. The disagreement is often about how far intuition cantake us in this type of circumscription. If we go too far, there is a danger of counter-examples to overly narrow norms or even inconsistencies between the norms. Often,epistemologists feel that one should only go as far as necessary, with an increasinglyheavy burden of proof as indeterminacies give way to determinate solutions. If wedo not go far enough, weak norms licence partial beliefs that are counter-intuitive tosome. Much of the debate centres around striking an intuitively plausible balancebetween these two impulses.This dissertation defends an approach to norms for the partial beliefs of a rationalagent that many may find constricting. In scientific practice, however, there may bean advantage to a relatively specific normative theory of partial beliefs. The scien-tist does not need to consult the philosopher on questions of updating procedures,for example, if the simple intuitions of information theory reliably work with theirestablished formal methods. I will address this advantage again when I talk aboutthe full employment theorem (see section 7.1).There are examples from the history of partial belief norms which reflect thetension between the relatively liberal and relatively conservative approach. Carnap’scontinuum of inductive methods, even though as a continuum with variable lambdait pays homage to liberalism (which E.T. Jaynes subsequently criticized by fixingλ = 2, see Jaynes and Bretthorst, 2003), meets resistance primarily from those whoconsider Carnap’s assumptions too strong and his conclusions therefore too narrow.Bayesians are constantly confronted with examples where standard conditioning issupposed to give the wrong answer (see Howson and Urbach, 2006, 81; or Williamson,2011).At the foundation of my work is the intuition that allowing violations of informa-tion theory’s principles saddles us with “absurdities that qualify as rational” (quoting623.3. Pragmatic, Axiomatic, and Epistemic Justificationfrom Salmon, 1967, 81, in a different context), although these absurdities are ofteneverything but plain. Whether or not this intuition is defensible rests on the con-firmation and disconfirmation gleaned from counter-examples, conceptual questions,and the integrity and power of formal accounts.I have so far used the term ‘intuition’ without much reflection what its role is inthe justification of epistemic norms. There is a sense in which this dissertation issubject to a certain tension with respect to intuitions: on the one hand, I expressskepticism about the intuitions on which the full employment theorem relies when itgives primacy to the epistemologist’s intuitions about particular cases over generalupdating procedures. On the other hand, I repeatedly take recourse to intuitions todefend my own approach.There are two types of intuitions at play. The first type concerns our viewsabout the plausibility of rules and their implications on particular cases. The secondtype is restricted to particular cases and by definition escapes the general rule. Myargument is skeptical towards the second type. The first type is what gets ourthinking about the justification of epistemic norms off the ground in so far as thesenorms are ampliative with respect to logical coherence. First type intuitions areas necessary in epistemic justification as axioms are in mathematics, but they alsoshare with axioms that it is a fruitful endeavour to consider varying sets of intuitions,what their logical consequences are, and how these consequences square with otherintuitions we might have.63Chapter 4Asymmetry and the Geometry ofReason4.1 Contours of a ProblemIn the early 1970s, the dominant models for similarity in the psychological literaturewere all geometric in nature. Distance measures capturing similarity and dissimilaritybetween concepts obeyed minimality, symmetry, and the triangle inequality. ThenAmos Tversky wrote a compelling paper undermining the idea that a metric topologyis the best model (see Tversky, 1977). Tversky gave both theoretical and empiricalreasons why similarity between concepts fulfills neither minimality, nor symmetry,nor the triangle inequality. Geometry with its metric distance measures was in someways not a useful model of similarity. Tversky presented an alternative set-theoreticaccount which accommodated intuitions that could not be reconciled with a geometryof similarity.The aim of this chapter is to help along a similar paradigm shift when it comes toepistemic modeling of closeness or difference between subjective probability distri-butions. The ‘geometry of reason’ (a term coined by Richard Pettigrew and HannesLeitgeb, two of its advocates) violates reasonable expectations for an acceptablemodel. A non-metric alternative, information theory, fulfills many of these expecta-tions but violates others which are similarly intuitive. Instead of presenting a thirdalternative which coheres better with the list of expectations outlined in section 4.4,I defend the view that while the violations of the geometry of reason are irremedia-ble, there is a promise in the wings that an advanced formal account of informationtheory, using the theory of differential manifolds, can explain information theory’s644.1. Contours of a Problemviolations of prima facie reasonable expectations.The geometry of reason refers to a view of epistemic utility in which the under-lying topology for credence functions (which may be subjective probability distri-butions) on a finite number of events is a metric space. The set of non-negativecredences that an agent assigns to the outcome of a die roll, for example, is isomor-phic to R6≥0. If the agent fulfills the requirements of probabilism, the isomorphism isto the more narrow set S5, the five-dimensional simplex for whichp1 + p2 + p3 + p4 + p5 + p6 = 1. (4.1)For the remainder of this paper I will assume probabilism and an isomorphismbetween probability distributions P on an outcome space Ω with |Ω| = n and pointsp ∈ Sn−1 ⊂ Rn having coordinates pi = P (ωi), i = 1, . . . , n and ωi ∈ Ω. Since the iso-morphism is to a metric space, there is a distance relation between credence functionswhich can be used to formulate axioms relating credences to epistemic utility andto justify or to criticize contentious positions such as Bayesian conditionalization,the principle of indifference, other forms of conditioning, or probabilism itself (seeespecially works cited below by James Joyce; Pettigrew and Leitgeb; David Wallaceand Hilary Greaves). For information theory, as opposed to the geometry of reason,the underlying topology for credence functions is not a metric space (see figures 4.1and 4.2 for illustration).Epistemic utility in Bayesian epistemology has attracted some attention in thepast few years. Patrick Maher provides a compelling acceptance-based account ofepistemic utility (see Maher, 1993, 182–207). Joyce, in “A Nonpragmatic Vindica-tion of Probabilism,” defends probabilism supported by partial-belief-based epistemicutility rather than the pragmatic utility common in Dutch-book style arguments (seeJoyce, 1998). For Joyce, norms of gradational accuracy characterize the epistemicutility approach to partial beliefs, analogous to norms of truth for full beliefs.Wallace and Greaves investigate epistemic utility functions along ‘stability’ linesand conclude that for everywhere stable utility functions standard conditioning is654.1. Contours of a Problemx yzAFigure 4.1: The simplex S2 in three-dimensional space R3 withcontour lines corresponding to the geometry of reason around pointA in equation (4.12). Points on the same contour line are equidistantfrom A with respect to the Euclidean metric. Compare the contourlines here to figure 4.2. Note that this diagram and all the followingdiagrams are frontal views of the simplex.optimal, while only somewhere stable utility functions create problems for maximiz-ing expected epistemic utility norms (see Greaves and Wallace, 2006; and Pettigrew,2013). Richard Pettigrew and Hannes Leitgeb have published arguments that un-der certain assumptions probabilism and standard conditioning (which together giveepistemology a distinct Bayesian flavour) minimize inaccuracy, thereby providingmaximal epistemic utility (see Leitgeb and Pettigrew, 2010a and 2010b).Leitgeb and Pettigrew show, given the geometry of reason and other axiomsinspired by Joyce (for example normality and dominance), that in order to avoidepistemic dilemmas we must commit ourselves to a Brier score measure of inaccuracyand subsequently to probabilism and standard conditioning. The Brier score is the664.1. Contours of a Problemx yzAFigure 4.2: The simplex S2 with contour lines corresponding toinformation theory around point A in equation (4.12). Points on thesame contour line are equidistant from A with respect to the Kullback-Leibler divergence. The contrast to figure 4.1 will become clear inmuch more detail in the body of the chapter. Note that the contourlines of the geometry of reason are insensitive to the boundaries of thesimplex, while the contour lines of information theory reflect them.One of the main arguments in this chapter is that information theoryrespects epistemic intuitions we have about asymmetry: proximity toextreme beliefs with very high or very low probability influences thetopology that is at the basis of updating.mean squared error of a probabilistic forecast. For example, if we look at 100 daysfor which the forecast was 30% rain and the incidence of rain was 32 days, then theBrier score is1NN∑t=1(ft − ot) = 1100(32 · (0.3− 1) + 68 · (0.3− 0)) = 0.218 (4.2)674.1. Contours of a Problem0 is a perfect match between forecast and reality in the sense that the forecasteranticipates every instance of rain with a 100% forecast and every instance of no rainwith a 0% forecast.Jeffrey conditioning (also called probability kinematics) is widely considered tobe a commonsense extension of standard conditioning. On Leitgeb and Pettigrew’saccount, using the Brier score, it fails to provide maximal epistemic utility. Anothertype of conditioning, which I will call LP conditioning, takes the place of Jeffreyconditioning. The failure of Jeffrey conditioning to minimize inaccuracy on the basisof the geometry of reason casts, by reductio, doubt on the geometry of reason.I will show that LP conditioning, which the geometry of reason entails, fails com-monsense expectations that are reasonable to have for the kind of updating scenariothat LP conditioning addresses. To relate probability distributions to each othergeometrically, using the isomorphism between the set of probability distributions ona finite event space W with |W | = n and the n− 1-dimensional simplex Sn−1 ⊂ Rn,is initially an arbitrary move. Leitgeb and Pettigrew do little to substantiate a linkbetween the geometry of reason and epistemic utility on a conceptual level. It is theformal success of the model that makes the geometry of reason attractive, but thefailure of LP conditioning to meet basic expectations undermines this success.The question then remains whether we have a plausible candidate to supplantthe geometry of reason. The answer is yes: information theory provides us with ameasure of closeness between probability distributions on a finite event space thathas more conceptual appeal than the geometry of reason, especially with respect toepistemic utility—it is intuitively correct to relate coming-to-knowledge to exchangeof information. More persuasive than intuition, however, is the fact that informationtheory supports both standard conditioning (see Williams, 1980) and the extensionof standard conditioning to Jeffrey conditioning (see subsection 3.2.5), an extensionwhich is on the one hand intuitive (see Wagner, 2002) and on the other hand for-mally continuous with the standard conditioning which Leitgeb and Pettigrew haveworked so hard to vindicate nonpragmatically. LP conditioning is not continuouswith standard conditioning, which is reflected in one of the expectations that LPconditioning fails to meet.684.2. Epistemic Utility and the Geometry of Reason4.2 Epistemic Utility and the Geometry ofReason4.2.1 Epistemic Utility for Partial BeliefsThere is more epistemic utility for an agent in believing a truth rather than notbelieving it and in not believing a falsehood rather than believing it. Accuracy infull belief epistemology can be measured by counting four sets, believed truths andfalsehoods as well as unbelieved truths and falsehoods, and somehow relating themto each other such that epistemic utility is rewarded and epistemic vice penalized.Accuracy in partial belief epistemology must take a different shape since as a ‘guess’all partial non-full beliefs are off the mark so that they need to be appreciated as‘estimates’ instead. Richard Jeffrey distinguishes between guesses and estimates: aguess fails unless it is on target, whereas an estimate succeeds depending on howclose it is to the target.The gradational accuracy needed for partial belief epistemology is reminiscent ofverisimilitude and its associated difficulties in the philosophy of science (see Pop-per, 1963; Gemes, 2007; and Oddie, 2013). Joyce and Leitgeb/Pettigrew proposevarious axioms for a measure of gradational accuracy for partial beliefs relying onthe geometry of reason, i.e. the idea of geometrical distance between distributions ofpartial belief expressed in non-negative real numbers. In Joyce, a metric space forprobability distributions is adopted without much reflection. The midpoint betweentwo points, for example, which is freely used by Joyce, assumes symmetry betweenthe end points. The asymmetric divergence measure that I propose as an alternativeto the Euclidean distance measure has no meaningful concept of a midpoint.Leitgeb and Pettigrew muse about alternative geometries, especially non-Euclideanones. They suspect that these would be based on and in the end reducible to Eu-clidean geometry but they do not entertain the idea that they could drop the re-quirement of a metric topology altogether (for the use of non-Euclidean geodesicsin statistical inference see Amari, 1985). Thomas Mormann explicitly warns againstthe assumption that the metrics for a geometry of logic is Euclidean by default: “All694.2. Epistemic Utility and the Geometry of Reasontoo often, we rely on geometric intuitions that are determined by Euclidean preju-dices. The geometry of logic, however, does not fit the standard Euclidean metricalframework” (see Mormann, 2005, 433; also Miller, 1984). Mormann concludes in hisarticle “Geometry of Logic and Truth Approximation,”Logical structures come along with ready-made geometric structures that can be usedfor matters of truth approximation. Admittedly, these geometric structures differ fromthose we are accostumed [sic] with, namely, Euclidean ones. Hence, the geometry oflogic is not Euclidean geometry. This result should not come as a big surprise. Thereis no reason to assume that the conceptual spaces we use for representing our theoriesand their relations have an [sic] Euclidean structure. On the contrary, this wouldappear to be an improbable coincidence. (Mormann, 2005, 453.)4.2.2 Axioms for Epistemic UtilityLeitgeb and Pettigrew present the following salient axioms (see Leitgeb and Petti-grew, 2010a, 219):Local Normality and Dominance: If I is a legitimate inaccuracy measure, thenthere is a strictly increasing function f : R+0 → R+0 such that, for any A ∈W , w ∈W ,and x ∈ R+0 ,I(A,w, x) = f (|χA(w)− x|) . (4.3)Global Normality and Dominance: If G is a legitimate global inaccuracy measure,there is a strictly increasing function g : R+0 → R+0 such that, for all worlds w andbelief functions b ∈ Bel(W ),G(w, b) = g (‖w − bglo‖) . (4.4)Similarly to Joyce, these axioms are justified on the basis of geometry, but thistime more explicitly so:Normality and Dominance [are] a consequence of taking seriously the talk of inaccuracyas ‘distance’ from the truth, and [they endorse] the geometrical picture provided by704.2. Epistemic Utility and the Geometry of ReasonEuclidean n-space as the correct clarification of this notion. As explained in section3.2, the assumption of this geometrical picture is one of the presuppositions of ouraccount, and we do not have much to offer in its defense, except for stressing thatwe would be equally interested in studying the consequences of minimizing expectedinaccuracy in a non-Euclidean framework. But without a doubt, starting with theEuclidean case is a natural thing to do.Leitgeb and Pettigrew define two notions, local and global inaccuracy, and showthat one must adopt a Brier score to measure inaccuracy in order to avoid epistemicdilemmas trying to minimize inaccuracy on both measures. To give the reader anidea what this looks like in detail and for purposes of later exposition, I want toprovide some of the formal apparatus. Let W be a set of worlds and A ⊆ W aproposition. ThenI : P (W )×W × R+0 → R+0 (4.5)is a measure of local inaccuracy such that I(A,w, x) measures the inaccuracy of thedegree of credence x with respect to A at world w. Let Bel(W ) be the set of all belieffunctions (what I have been calling distributions of partial belief). ThenG : W × Bel(W )→ R+0 (4.6)is a measure of global inaccuracy of a belief function b at a possible world w suchthat G(w, b) measures the inaccuracy of a belief function b at world w.Axioms such as normality and dominance guarantee that the only legitimatemeasures of inaccuracy are Brier scores if one wants to avoid epistemic dilemmaswhere one receives conflicting advice from the local and the global measures. Forlocal inaccuracy measures, this means that there is λ ∈ R+ such that714.2. Epistemic Utility and the Geometry of ReasonI(A,w, x) = λ (χA(w)− x)2 (4.7)where χA is the characteristic function of A. For global inaccuracy measures, thismeans that there is µ ∈ R+ such thatG(w, b) = µ‖w − b‖2 (4.8)where w and b are represented by vectors and ‖u− v‖ is the Euclidean distance√√√√ n∑i=1(ui − vi)2. (4.9)I use (4.7) to define expected local inaccuracy of degree of belief x in propositionA by the lights of belief function b, with respect to local inaccuracy measure I, andover the set E of epistemically possible worlds as follows:LExpb(I, A,E, x) =∑w∈Eb({w})I(A,w, x) =∑w∈Eb({w})λ (χA(w)− x)2 . (4.10)I use (4.8) to define expected global inaccuracy of belief function b′ by the lightsof belief function b, with respect to global inaccuracy measure G, and over the set Eof epistemically possible worlds as follows:GExpb(G,E, b′) =∑w∈Eb({w})G(w, b′) =∑w∈Eb({w})µ‖w − b‖2. (4.11)To give a flavour of how attached the axioms are to the geometry of reason, hereare Joyce’s axioms called Weak Convexity and Symmetry, which he uses to justify724.2. Epistemic Utility and the Geometry of Reasonprobabilism. Note that in terms of notation I for Joyce is global and related toLeitgeb and Pettigrew’s G in (4.6) rather than I in (4.5):Weak Convexity: Let m = (0.5b′ + 0.5b′′) be the midpoint of the line segmentbetween b′ and b′′. If I(b′, ω) = I(b′′, ω), then it will always be the case that I(b′, ω) ≥I(m,ω) with identity only if b′ = b′′.Symmetry: If I(b′, ω) = I(b′′, ω), then for any λ ∈ [0, 1] one hasI(λb′ + (1− λ)b′′, ω) = I((1− λ)b′ + λb′′), ω).Joyce advocates for these axioms in Euclidean terms, using justifications such as“the change in belief involved in going from b′ to b′′ has the same direction but adoubly greater magnitude than change involved in going from b′ to [the midpoint]m” (see Joyce, 1998, 596). In section 4.5.3, I will show that Weak Convexity holds,and Symmetry does not hold, in ‘information geometry,’ the topology generated bythe Kullback-Leibler divergence. The term information geometry is due to ImreCsisza´r, who considers the Kullback-Leibler divergence a non-commutative (asym-metric) analogue of squared Euclidean distance and derives several results that areintuitive information geometric counterparts of standard results in Euclidean geom-etry (see chapter 3 of Csisza´r and Shields, 2004).4.2.3 Expectations for Jeffrey-Type Updating ScenariosLeitgeb and Pettigrew’s work is continuous with Joyce’s work, but significantly goesbeyond it. Joyce wants weaker assumptions and would be leery of expected inaccu-racies (4.10) and (4.11), as they might presuppose the probabilism that Joyce wantsto justify. Leitgeb and Pettigrew investigate not only whether probabilism and stan-dard conditioning follow from gradational accuracy based on the geometry of reason,but also uniform distribution (their term for the claim of objective Bayesians thatthere is some principle of indifference for ignorance priors) and Jeffrey conditioning.They show that uniform distribution requires additional axioms which are much lessplausible than the ones on the basis of which they derive probabilism and standard734.2. Epistemic Utility and the Geometry of Reasonconditioning (see Leitgeb and Pettigrew, 2010b, 250f); and that Jeffrey conditioningdoes not fulfill Joyce’s Norm of Gradational Accuracy (see Joyce, 1998, 579) andtherefore violates the pursuit of epistemic utility. Leitgeb and Pettigrew provide uswith an alternative method of updating for Jeffrey-type updating scenarios, which Iwill call LP conditioning.Example 11: Sherlock Holmes. Sherlock Holmes attributes the following proba-bilities to the propositions Ei that ki is the culprit in a crime: P (E1) = 1/3, P (E2) =1/2, P (E3) = 1/6, where k1 is Mr. R., k2 is Ms. S., and k3 is Ms. T. Then Holmes findssome evidence which convinces him that P ′(F ∗) = 1/2, where F ∗ is the propositionthat the culprit is male and P is relatively prior to P ′. What should be Holmes’updated probability that Ms. S. is the culprit?I will look at the recommendations of Jeffrey conditioning and LP conditioningfor example 11 in the next section. For now note that LP conditioning violates allof the following plausible expectations in List A for an amujus, short for ‘a methodof updating for Jeffrey-type updating scenarios.’ This is List A:• continuity An amujus ought to be continuous with standard conditioning asa limiting case.• regularity An amujus ought not to assign a posterior probability of 0 toan event which has a positive prior probability and about which the interven-ing evidence says nothing except that a strictly weaker event has a positiveposterior probability.• levinstein An amujus ought not to give “extremely unattractive” results in aLevinstein scenario (see Levinstein, 2012, which not only articulates this failedexpectation for LP conditioning, but also the previous two).• invariance An amujus ought to be partition invariant.• expansibility An amujus ought to be insensitive to an expansion of the eventspace by zero-probability events.744.2. Epistemic Utility and the Geometry of Reason• confirmation An amujus ought to align with intuitions we have about de-grees of confirmation.• horizon An amujus ought to exhibit the horizon effect which makes probabil-ity distributions which are nearer to extreme probability distributions appearto be closer to each other than they really are.Jeffrey conditioning and LP conditioning are both an amujus based on a conceptof quantitative difference between probability distributions measured as a functionon the isomorphic manifold (in my case, an n − 1-dimensional simplex). Evidenceappears in the form of a constraint on acceptable probability distributions and theclosest acceptable probability to the orginal (relatively prior) probability distributionis chosen as its successor. Here is List B, a list of reasonable expectations one mayhave toward this concept of quantitative difference (we call it a distance function forthe geometry of reason and a divergence for information theory). Let d(p, q) expressthis concept mathematically.• triangularity The concept obeys the triangle inequality. If there is anintermediate probability distribution, it will not make the difference smaller:d(p, r) ≤ d(p, q) + d(q, r). Buying a pair of shoes is not going to be moreexpensive than buying the two shoes individually.• collinear horizon This expecation is just a more technical restatementof the horizon expectation in the previous list. If p, p′, q, q′ are collinearwith the centre of the simplex m (whose coordinates are mi = 1/n for all i)and an arbitrary but fixed boundary point ξ ∈ ∂Sn−1 and p, p′, q, q′ are allbetween m and ξ with ‖p′− p‖ = ‖q′− q‖ where p is strictly closest to m, then|d(p, p′)| < |d(q, q′)|. For an illustration of this expectation see figure 4.3. Theabsolute value is added as a feature to accommodate degree of confirmationfunctions in subsection 4.4.7, which may be negative.• transitivity of asymmetry An ordered pair (p, q) of simplex points associ-ated with probability distributions is asymmetrically negative, positive, or bal-anced, so either d(p, q)−d(q, p) < 0 or d(p, q)−d(q, p) > 0 or d(p, q)−d(q, p) = 0.754.2. Epistemic Utility and the Geometry of ReasonIf (p, q) and (q, r) are asymmetrically positive, (p, r) ought not to be asymmetri-cally negative. Think of a bicycle route map with different locations at varyingaltitudes. If it takes 20 minutes to get from A to B but only 15 minutes to getfrom B to A then (A,B) is asymmetrically positive. If (A,B) and (B,C) areasymmetrically positive, then (A,C) ought not to be asymmetrically negative.x yzmξpp′qq′Figure 4.3: An illustrations of conditions (i)–(iii) for collinearhorizon in List B. p, p′ and q, q′ must be equidistant and collinearwith m and ξ. If q, q′ is more peripheral than p, p′, then collinearhorizon requires that |d(p, p′)| < |d(q, q′)|.While the Kullback-Leibler divergence of information theory fulfills all the expec-tations of List A, save horizon, it fails all the expectations in List B. Obversely,the Euclidean distance of the geometry of reason fulfills all the expectations of ListB, save collinear horizon, and fails all the expectations in List A. Informationtheory has its own axiomatic approach to justifying probabilism and standard con-ditioning (see Shore and Johnson, 1980). Information theory provides a justification764.3. Geometry of Reason versus Information Theoryfor Jeffrey conditioning and generalizes it (see subsection 3.2.5). All of these virtuesstand in contrast to the violations of the expectations in List B. The rest of thischapter fills in the details of these violations both for the geometry of reason andinformation theory, with the conclusion that the case for the geometry of reason ishopeless while the case for information theory is now a major challenge for futureresearch projects.4.3 Geometry of Reason versus InformationTheoryHere is a simple example corresponding to example 11 where the distance of geometryand the divergence of information theory differ. With this difference in mind, I willshow how LP conditioning fails the expectations outlined in List A (see page 74).Consider the following three points in three-dimensional space:a =(13,12,16)b =(12,38,18)c =(12,512,112)(4.12)All three are elements of the simplex S2: their coordinates add up to 1. Thusthey represent probability distributions A,B,C over a partition of the event spaceinto three events. The Kullback-Leibler divergence and Euclidean distance give dif-ferent recommendations with respect to proximity. Note that the Kullback-Leiblerdivergence defined in (3.18) is always positive as a consequence of Gibbs’ inequality,irrespective of dimension, (see MacKay, 2003, sections 2.6 and 2.7). The Euclideandistance ‖B−A‖ is defined as in equation (4.9). What is remarkable about the threepoints in (4.12) is that‖C − A‖ ≈ 0.204 < ‖B − A‖ ≈ 0.212 (4.13)and774.3. Geometry of Reason versus Information TheoryDKL(B,A) ≈ 0.0589 < DKL(C,A) ≈ 0.069. (4.14)Assuming the global inaccuracy measure GExp presented in (4.8) and E = W (allpossible worlds are epistemically accessible),GExpA(C) ≈ 0.653 < GExpA(B) ≈ 0.656. (4.15)x yzABCFigure 4.4: The simplex S2 in three-dimensional space R3 withpoints a, b, c as in equation (4.12) representing probability distribu-tions A,B,C. Note that geometrically speaking C is closer to A thanB is. Using the Kullback-Leibler divergence, however, B is closer toA than C is. The same case modeled with statistical parameters isillustrated in figures 3.1 and 3.2.Global inaccuracy reflects the Euclidean proximity relation, not the recommendation784.3. Geometry of Reason versus Information Theoryof information theory. If A corresponds to my prior and my evidence is such thatI must change the first coordinate to 1/2 (as in example 11) and nothing stronger,then information theory via the Kullback-Leibler divergence recommends the poste-rior corresponding to B; and the geometry of reason as expounded in Leitgeb andPettigrew recommends the posterior corresponding to C. There are several thingsgoing on here that need some explanation.4.3.1 Evaluating Partial Beliefs in Light of OthersWe note that for Leitgeb and Pettigrew, expected global inaccuracy of b′ is alwaysevaluated by the lights of another partial belief distribution b. This may soundcounterintuitive. Should we not evaluate b′ by its own lights? It is part of a largerBayesian commitment that partial belief distributions are not created ex nihilo. Theycan also not be evaluated for inaccuracy ex nihilo. Leitgeb and Pettigrew say verylittle about this, but it appears that there is a deeper problem here with the flowof diachronic updating. The classic Bayesian picture is one of moving from a rela-tively prior probability distribution to a posterior distribution (distinguish relativelyprior probability distributions, which precede posterior probability distributions inupdating, from absolutely prior probability distributions, which are ignorance priorsin the sense that they are not the resulting posteriors of previous updating). This isnicely captured by standard conditioning, Bayes’ formula, and updating on the basisof information theory (Jeffrey conditioning, principle of maximum entropy).The geometry of reason and notions of accuracy based on it sit uncomfortablywith this idea of flow, as the suggestion is that partial belief distributions are eval-uated on their accuracy without reference to prior probability distributions—whyshould the accuracy or epistemic utility of a posterior probability distribution de-pend on a prior probability distribution which has already been debunked by theevidence? I agree with Leitgeb and Pettigrew that there is no alternative here butto evaluate the posterior by the lights of the prior. Not doing so would saddle uswith Carnap’s Straight Rule, where priors are dismissed as irrelevant (see Carnap,1952, 40ff). Yet we shall note that a justification of evaluating a belief function’s794.3. Geometry of Reason versus Information Theoryaccuracy by the lights of another belief function is a lot less persuasive than the wayBayesians and information theory integrate prior distributions into forming poste-rior distributions by virtue of an asymmetric flow of information (see also Shogenji,2012, who makes a strong case for the influence of prior probabilities on epistemicjustification).4.3.2 LP conditioning and Jeffrey ConditioningI want to outline how Leitgeb and Pettigrew arrive at posterior probability distribu-tions in Jeffrey-type updating scenarios. I will call their method LP conditioning.Example 12: Abstract Holmes. Consider a possibility space W = E1 ∪ E2 ∪ E3(the Ei are sets of states which are pairwise disjoint and whose union is W ) and apartition F of W such that F = {F ∗, F ∗∗} = {E1, E2 ∪ E3}.Let P be the prior probability function on W and P ′ the posterior. I will keepthe notation informal to make this simple, not mathematically precise. Jeffrey-typeupdating scenarios give us new information on the posterior probabilities of partitionssuch as F . In example 12, letP (E1) = 1/3P (E2) = 1/2P (E3) = 1/6(4.16)and the new evidence constrain P ′ such that P ′(F ∗) = 1/2 = P ′(F ∗∗).Jeffrey conditioning works on the following intuition, which elsewhere I havecalled Jeffrey’s updating principle jup (see also Wagner, 2002). The posterior proba-bilities conditional on the partition elements equal the prior probabilities conditionalon the partition elements since we have no information in the evidence that theyshould have changed. Hence,804.3. Geometry of Reason versus Information TheoryP ′JC(Ei) =P′(Ei|F ∗)P ′(F ∗) + P ′(Ei|F ∗∗)P ′(F ∗∗)=P (Ei|F ∗)P ′(F ∗) + P (Ei|F ∗∗)P ′(F ∗∗) (4.17)Jeffrey conditioning is controversial (for an introduction to Jeffrey conditioningsee Jeffrey, 1965; for its statistical and formal properties see Diaconis and Zabell,1982; for a pragmatic vindication of Jeffrey conditioning see Armendt, 1980, andSkyrms, 1986; for criticism see Howson and Franklin, 1994). Information theory,however, supports Jeffrey conditioning. Leitgeb and Pettigrew show that Jeffreyconditioning does not in general pick out the minimally inaccurate posterior proba-bility distribution. If the geometry of reason as presented in Leitgeb and Pettigrewis sound, this would constitute a powerful criticism of Jeffrey conditioning. Leitgeband Pettigrew introduce an alternative to Jeffrey conditioning, which I call LP condi-tioning. It proceeds as follows for example 12 and in general provides the minimallyinaccurate posterior probability distribution in Jeffrey-type updating scenarios.Solve the following two equations for x and y:P (E1) + x = P′(F ∗)P (E2) + y + P (E3) + y = P′(F ∗∗)(4.18)and then setP ′LP(E1) = P (E1) + xP ′LP(E2) = P (E2) + yP ′LP(E3) = P (E3) + y(4.19)For the more formal and more general account see Leitgeb and Pettigrew, 2010b,254. The results for example 12 are:814.3. Geometry of Reason versus Information TheoryP ′LP(E1) = 1/2P ′LP(E2) = 5/12P ′LP(E3) = 1/12(4.20)Compare these results to the results of Jeffrey conditioning:P ′JC(E1) = 1/2P ′JC(E2) = 3/8P ′JC(E3) = 1/8(4.21)Note that (4.16), (4.21), and (4.20) correspond to A,B,C in (4.12).4.3.3 Triangulating LP and Jeffrey ConditioningThere is an interesting connection between LP conditioning and Jeffrey conditioningas updating methods. Let B be on the zero-sum line between A and C if and only ifd(A,C) = d(A,B) + d(B,C) (4.22)where d is the difference measure I am using, so d(A,B) = ‖B−A‖ for the geometryof reason and d(A,B) = DKL(B,A) for information geometry. For the geometry ofreason (and Euclidean geometry), the zero-sum line between two probability distribu-tions is just what we intuitively think of as a straight line: in Cartesian coordinates,B is on the zero-sum line strictly between A and C if and only if for some ϑ ∈ (0, 1),bi = ϑai + (1− ϑ)ci and i = 1, . . . , n.What the zero-sum line looks like for information theory is illustrated in figure 4.5.The reason for the oddity is that the Kullback-Leibler divergence does not obeytriangularity, an issue that I will address in detail in subsection 4.5.1). CallB a zero-sum point between A and C if (4.22) holds true. For the geometry of824.3. Geometry of Reason versus Information Theoryreason, the zero-sum points are simply the points on the straight line between Aand C. For information geometry, the zero-sum poins are the boundary points ofthe set where you can take a shortcut by making a detour, i.e. all points for whichd(A,B) + d(B,C) < d(A,C).x yzABCEFigure 4.5: The zero-sum line between A and C is the boundaryline between the green area, where the triangle inequality holds, andthe red area, where the triangle inequality is violated. The poste-rior probability distribution B recommended by Jeffrey conditioningalways lies on the zero-sum line between the prior A and the LP pos-terior C, as per equation (4.23). E is the point in the red area wherethe triangle inequality is most efficiently violated.Remarkably, if A represents a relatively prior probability distribution and C theposterior probability distribution recommended by LP conditioning, the posteriorprobability distribution recommended by Jeffrey conditioning is always a zero-sumpoint with respect to the Kullback-Leibler divergence:834.3. Geometry of Reason versus Information TheoryDKL(C,A) = DKL(B,A) +DKL(C,B) (4.23)Informationally speaking, if you go from A to C, you can just as well go from Ato B and then from B to C. This does not mean that we can conceive of informationgeometry the way we would conceive of non-Euclidean geometry, where it is alsopossible to travel faster on what from a Euclidean perspective looks like a detour.For in information geometry, you can travel faster on what from the perspective ofinformation theory (!) looks like a detour, i.e. the triangle inequality does not hold.To prove equation (4.23) in the case n = 3 (assuming that LP conditioning doesnot ‘fall off the edge’ as in case (b) in Leitgeb and Pettigrew, 2010b, 253) note that allthree points (prior, point recommended by Jeffrey conditioning, point recommendedby LP conditioning) can be expressed using three variables:A = (1− α, β, α− β)B =(1− γ, γβα, γ(α−β)α)C =(1− γ, β + 12(γ − α), α− β + 12(γ − α))(4.24)The rest is basic algebra using the definition of the Kullback-Leibler divergencein (3.18). To prove the claim for arbitrary n one simply generalizes (4.24). It is ahandy corollary of (4.23) that whenever (A,B) and (B,C) violate transitivity ofasymmetry thenDKL(A,C) > DKL(B,C) +DKL(A,B) (4.25)in violation of triangularity. This way I will not have to go hunting for anexample to demonstrate the violation of triangularity. A,B,C of (4.12) fulfill844.3. Geometry of Reason versus Information Theoryall the conditions for (4.25) and therefore violate triangularity.It is an interesting question to wonder which point E violates the triangle in-equality most efficiently so thatDKL(E,C) +DKL(A,E) (4.26)is minimal. Let e = (e1, . . . , en) represent E in Sn−1. Use the Lagrange Multipliermethod to find the LagrangianL(e, λ) =n∑i=1ei logeiai+n∑i=1ci logciei+ λ(n∑i=1ei − 1)(4.27)The Lagrange Multiplier method gives us∂L∂ek= logekak+ 1− rq+ λ = 0 for each k = 1, . . ., n. (4.28)Manipulate this equation to yieldckekexp(ckek)=ckakexp(1 + λ). (4.29)To solve (4.29), use the Lambert W functionek =ckW(ckakexp(1 + λ)) . (4.30)Choose λ to fulfill the constraint∑ei = 1. The result for the discrete case accordswith Ovidiu Calin and Constantin Udriste’s result for the continuous case (see equa-tion 4.7.9 in Calin and Udriste, 2014, 127). Numerically, for A and C as defined in854.4. Expectations for the Geometry of Reasonequation (4.12),E = (0.415, 0.462, 0.123). (4.31)This is subtly different from the midpoint mi = 0.5ai + 0.5ci (if we were minimizingDKL(A,E)+DKL(C,E), the solution would be the midpoint). I do not know whetherA,E,C are collinear (see figure 4.5 for illustration).4.4 Expectations for the Geometry of ReasonThis section provides more detail for the expectations in List A (see page 74) andshows how LP conditioning violates them.4.4.1 ContinuityLP conditioning violates continuity because standard conditioning gives a dif-ferent recommendation than a parallel sequence of Jeffrey-type updating scenarioswhich get arbitrarily close to standard event observation. This is especially trou-bling considering how important the case for standard conditioning is to Leitgeb andPettigrew.To illustrate a continuity violation, consider the case where Sherlock Holmesreduces his credence that the culprit was male to εn = 1/n for n = 4, 5, . . .. Thesequence εn is not meant to reflect a case where Sherlock Holmes becomes successivelymore certain that the culprit was female. It is meant to reflect countably manyparallel scenarios which only differ by the degree to which Sherlock Holmes is surethat the culprit was female. These parallel scenarios give rise to a parallel sequence(as opposed to a successive sequence) of updated probabilities P ′LP(F∗∗) and anothersequence of updated probabilities P ′JC(F∗∗) (F ∗∗ is the proposition that the culpritis female). As n→∞, both of these sequences go to one.Straightforward conditionalization on the evidence that ‘the culprit is female’864.4. Expectations for the Geometry of Reasongives usP ′SC(E1) = 0P ′SC(E2) = 3/4P ′SC(E3) = 1/4.(4.32)Letting n→∞ for Jeffrey conditioning yieldsP ′JC(E1) = 1/n → 0P ′JC(E2) = 3(n− 1)/4n → 3/4P ′JC(E3) = (n− 1)/4n → 1/4,(4.33)whereas letting n→∞ for LP conditioning yieldsP ′LP(E1) = 1/n → 0P ′LP(E2) = (4n− 3)/6n → 2/3P ′LP(E3) = (2n− 5)/6n → 1/3.(4.34)LP conditioning violates continuity.4.4.2 RegularityLP conditioning violates regularity because formerly positive probabilities can bereduced to 0 even though the new information in the Jeffrey-type updating scenariomakes no such requirements (as is usually the case for standard conditioning). Iron-ically, Jeffrey-type updating scenarios are meant to be a better reflection of real-lifeupdating because they avoid extreme probabilities.The violation becomes serious if we are already sympathetic to an information-based account: the amount of information required to turn a non-extreme probabilityinto one that is extreme (0 or 1) is infinite. Whereas the geometry of reason considersextreme probabilities to be easily accessible by non-extreme probabilities under new874.4. Expectations for the Geometry of Reasoninformation (much like a marble rolling off a table or a bowling ball heading forthe gutter), information theory envisions extreme probabilities more like an eventhorizon. The nearer you are to the extreme probabilities, the more information youneed to move on. For an observer, the horizon is never reached.Example 13: Regularity Holmes. Everything is as in example 11, except thatSherlock Holmes becomes confident to a degree of 2/3 that Mr. R is the culprit andupdates his relatively prior probability distribution in (4.16).Then his posterior probabilities look as follows:P ′JC(E1) = 2/3P ′JC(E2) = 1/4P ′JC(E3) = 1/12(4.35)P ′LP(E1) = 2/3P ′LP(E2) = 1/3P ′LP(E3) = 0(4.36)With LP conditioning, Sherlock Holmes’ subjective probability that Ms. T is theculprit in example 13 has been reduced to zero. No finite amount of informationcould bring Ms. T back into consideration as a culprit in this crime, and SherlockHolmes should be willing to bet any amount of money against a penny that she isnot the culprit—even though his evidence is nothing more than an increase in theprobability that Mr. R is the culprit.LP conditioning violates regularity.4.4.3 LevinsteinLP conditioning violates levinstein because of “the potentially dramatic effect[LP conditioning] can have on the likelihood ratios between different propositions”(Levinstein, 2012, 419.). Consider Benjamin Levinstein’s example:884.4. Expectations for the Geometry of ReasonExample 14: Levinstein’s Ghost. There is a car behind an opaque door, whichyou are almost sure is blue but which you know might be red. You are almost certainof materialism, but you admit that there’s some minute possibility that ghosts exist.Now the opaque door is opened, and the lighting is fairly good. You are quite surprisedat your sensory input: your new credence that the car is red is very high.Jeffrey conditioning leads to no change in opinion about ghosts. Under LP con-ditioning, however, seeing the car raises the probability that there are ghosts to anastonishing 47%, given Levinstein’s reasonable priors. Levinstein proposes a loga-rithmic inaccuracy measure as a remedy to avoid violation of levinstein. As aspecial case of applying a Levinstein-type logarithmic inaccuracy measure, informa-tion theory does not violate levinstein.4.4.4 InvarianceLP conditioning violates invariance because two agents who have identical cre-dences with respect to a partition of the event space may disagree about this par-tition after LP conditioning, even when the Jeffrey-type updating scenario providesno new information about the more finely grained partitions on which the two agentsdisagree.Example 15: Jane Marple. Jane Marple is on the same case as Sherlock Holmesin example 11 and arrives at the same relatively prior probability distribution asSherlock Holmes (I will call Jane Marple’s relatively prior probability distribution Qand her posterior probability distributionQ′). Jane Marple, however, has a more finelygrained probability assignment than Sherlock Holmes and distinguishes between thecase where Ms. S went to boarding school with her, of which she has a vague memory,and the case where Ms. S did not and the vague memory is only about a fleetingresemblance of Ms. S with another boarding school mate. Whether or not Ms. S wentto boarding school with Jane Marple is completely beside the point with respect tothe crime, and Jane Marple considers the possibilities equiprobable whether or notMs. S went to boarding school with her.894.4. Expectations for the Geometry of ReasonLet E2 ≡ E∗2 ∨ E∗∗2 , where E∗2 is the proposition that Ms. S is the culprit and shewent to boarding school with Jane Marple and E∗∗2 is the proposition that Ms. S isthe culprit and she did not go to boarding school with Jane Marple. ThenQ(E1) = 1/3Q(E∗2) = 1/4Q(E∗∗2 ) = 1/4Q(E3) = 1/6.(4.37)Now note that while Sherlock Holmes and Jane Marple agree on the relevant factsof the criminal case (who is the culprit?) in their posterior probabilities if they useJeffrey conditioning,P ′JC(E1) = 1/2P ′JC(E2) = 3/8P ′JC(E3) = 1/8(4.38)Q′JC(E1) = 1/2Q′JC(E∗2) = 3/16Q′JC(E∗∗2 ) = 3/16Q′JC(E3) = 1/8(4.39)they do not agree if they use LP conditioning,P ′LP(E1) = 1/2P ′LP(E2) = 5/12P ′LP(E3) = 1/12(4.40)904.4. Expectations for the Geometry of ReasonQ′LP(E1) = 1/2Q′LP(E∗2) = 7/36Q′LP(E∗∗2 ) = 7/36Q′LP(E3) = 1/9.(4.41)LP conditioning violates invariance.4.4.5 ExpansibilityOne particular problem with the lack of invariance for LP conditioning is how zero-probability events should be included in the list of prior probabilities that determinesthe value of the posterior probabilities. ConsiderP (X1) = 0P (X2) = 0.3P (X3) = 0.6P (X4) = 0.1(4.42)That P (X1) = 0 may be a consequence of standard conditioning in a previous step.Now the agent learns that P ′(X3 ∨X4) = 0.5. Should the agent update on the listpresented in (4.42) or on the following list:P (X2) = 0.3P (X3) = 0.6P (X4) = 0.1(4.43)Whether you update on (4.42) or (4.43) makes no difference to Jeffrey conditioning,but due to the lack of invariance it makes a difference to LP conditioning, so thegeometry of reason needs to find a principled way to specify the appropriate prior914.4. Expectations for the Geometry of Reasonprobabilities. The only non-arbitrary way to do this is either to include or to excludeall zero probability events on the list. This strategy, however, sounds ill-advisedunless one signs on to a stronger version of regularity and requires that only afixed set of events can have zero probabilities (such as logical contradictions), butthen the geometry of reason ends up in the catch-22 of LP conditioning running afoulof regularity.LP conditioning violates expansibility.4.4.6 HorizonExample 16: Undergraduate Complaint. An undergraduate student complainsto the department head that the professor will not reconsider an 89% grade (whichmisses an A+ by one percent) when reconsideration was given to other students witha 67% grade (which misses a B- by one percent).Intuitions may diverge, but the professor’s reasoning is as follows. To improve a60% paper by ten percent is easily accomplished: having your roommate check yourgrammar, your spelling, and your line of argument will sometimes do the trick. It isincomparably more difficult to improve an 85% paper by ten percent: it may takedoing a PhD to turn a student who writes the former into a student who writes thelatter. A maiore ad minus, the step from 89% to 90% is greater than the step from67% to 68%.Another example for the horizon effect is George Schlesinger’s comparison be-tween the risk of a commercial airplane crash and the risk of a military glider landingin enemy territory.Example 17: Airplane Gliders. Compare two scenarios. In the first, an airplanewhich is considered safe (probability of crashing is 1/109) goes through an inspectionwhere a mechanical problem is found which increases the probability of a crash to1/100. In the second, military gliders land behind enemy lines, where their risk ofperishing is 26%. A slight change in weather pattern increases this risk to 27%.(Schlesinger, 1995, 211.)924.4. Expectations for the Geometry of ReasonFor an interesting instance of the horizon effect in asymmetric multi-dimensionalscaling see Chino and Shiraiwa, 1993, section 3, where Naohito Chino and KenichiShiraiwa describe as one of the properties of their Hilbert space models of asymmetryhow “the similarity between the pair of objects located far from the centroid ofobjects, say, the origin, is greater than that located near the origin, even if theirdistances are the same” (42).I claim that an amujus ought to fulfill the requirements of the horizon effect: itought to be more difficult to update as probabilities become more extreme (or lessmiddling). I have formalized this requirement in List B (see page 75). It is trivialthat the geometry of reason does not fulfill it. Information theory fails as well,which gives the horizon effect its prominent place in both lists. The way informationtheory fails, however, is quite different. Near the boundary of Sn−1, informationtheory reflects the horizon effect just as our expectation requires. The problem isnear the centre, where some equidistant points are more divergent the closer theyare to the middle. I will give an example and more explanation in subsection 4.5.2.In the next section, I will closely tie the issue of the horizon effect to confirmation.The two main candidates for quantitative measures of relevance confirmation disagreeon precisely this issue. Whether you, the reader, will accept the horizon requirementmay depend on what your view is on degree of confirmation theory.4.4.7 ConfirmationThe geometry of reason thinks about the comparison of probability distributions interms of distance. Information theory thinks about the comparison along the linesof information loss when one distribution is used to encode a message rather thanthe other distribution. One way to test these approaches is to ask how well theyalign with a third approach to such a comparison: degree of confirmation. My mainconcern is the horizon effect of the previous subsection. Which approaches to degreeof confirmation theory reflect it, and how do these approaches correspond to thedisagreements between information theory and the geometry of reason?There is, of course, a relevant difference between the aims of the epistemic utility934.4. Expectations for the Geometry of Reasonapproach to updating and the aims of degree of confirmation theory. The formerinvestigates norms to which a rational agent conforms in her pursuit of epistemicutility. The latter seeks to establish qualitative and quantitative measures of impactthat evidence has on a hypothesis. Both, however, (I will restrict my attention hereto quantitative degree of confirmation theory) attend to the probability of an event,which degree of confirmation theory customarily calls h for hypothesis, before andafter the rational agent processes another event, customarily called e for evidence,i.e. x = P (h|k) and y = P (h|e, k) (k is background information).For perspectives on the link between confirmation and information see Shogenji,2012, 37f; Crupi and Tentori, 2014; and Milne, 2014, section 4. Vincenzo Crupiand Katya Tentori suggest that there is a “parallelism between confirmation andinformation search [which] can serve as a valuable heuristic for theoretical work”(Crupi and Tentori, 2014, 89.).In degree of confirmation theory, incremental confirmation is distinguished fromabsolute confirmation in the following sense. Let h be the presence of a very raredisease and e a test result such that y >> x but y < 1−y. Then, absolutely speaking,e disconfirms h (for Rudolf Carnap, absolute confirmation involves sending y abovea threshold r which must be greater than or equal to 0.5). Absolute confirmationis not the subject of this section. I will exclusively discuss incremental confirmation(also called relevance confirmation, just as absolute confirmation is sometimes calledfirmness confirmation) where y > x implies (incremental) confirmation, y < x implies(incremental) disconfirmation, and y = x implies the lack of both. The difference isillustrated in figure 4.6.All proposed measures of quantitative, incremental degree of confirmation con-sidered here are a function of x and y. Dependence of incremental confirmation ononly x and y is not trivial, as P (e|k) and P (e|h, k) cannot be expressed using onlyx and y (for a case why dependence should be on only x and y see Atkinson, 2012,50, with an irrelevant conjunction argument; and Milne, 2014, 254, with a continuityargument). David Christensen’s measure P (h|e, k) − P (h|qe, k) (see Christensen,1999, 449) and Robert Nozick’s P (e|h, k) − P (e|qh, k) (see Nozick, 1981, 252) arenot only dependent on x and y, but also on P (e|k), which makes them vulnerable to944.4. Expectations for the Geometry of ReasonAtkinson’s and Milne’s worries just cited.Consider the following six contenders for a quantitative, incremental degree ofconfirmation function, dependent on only x and y. They are based on, in a brief slo-gan, (i) difference of conditional probabilities, (ii) ratio of conditional probabilities,(iii) difference of odds, (iv) ratio of likelihoods, (v) Gaifman’s treatment of Hempel’sraven paradox, and (vi) conservation of contrapositivity and commutativity. Log-arithms throughout this dissertation are assumed to be the natural logarithm inorder to facilitate easy differentiation, although generally a particular choice of base(greater than one) does not make a relevant difference.(i) MP (x, y) = y − x(ii) RP (x, y) = logyx(iii) JP (x, y) =y1− y −x1− x(iv) LP (x, y) = logy(1− x)x(1− y)(v) GP (x, y) = log1− x1− y(vi) ZP (x, y) =y − x1− x if y ≥ xy − xxif y < x(4.44)MP is defended by Carnap, 1962; Earman, 1992; Rosenkrantz, 1994. RP is defendedby Keynes, 1921; Milne, 1996; Shogenji, 2012. JP is defended by Festa, 1999. LP isdefended by Good, 1950; Good, 1983, chapter 14; Fitelson, 2006; Zalabardo, 2009.GP is defended by Gaifman, 1979, 120, without the logarithm (I added it to makeGP more comparable to the other functions). ZP is defended by Crupi et al., 2007.954.4. Expectations for the Geometry of ReasonFor more literature supporting the various measures consult footnote 1 in Fitelson,2001, S124; and an older survey of options in Kyburg, 1983.To compare how these degree of confirmation measures align with the conceptof difference between probability distributions for the purpose of updating it is bestto look at derivatives as they reflect the rate of change from the middle to theextremes. This is how we capture the horizon effect requirement for two dimensions.One important difference between degree of confirmation theory and updating isthat the former is concerned with a hypothesis and its negation whereas the latterconsiders all sorts of domains for the probability distribution (in this chapter, I haverestricted myself to a finite outcome space). As far as the analogy between degree ofconfirmation theory on the one hand and updating on the other hand is concerned,I only need to look at the two-dimensional case.To discriminate between candidates (i)–(vi), I am setting up three criteria (com-plementing many others in the literature). Let D(x, y) be the generic expression forthe degree of confirmation function. Call this List C.• additivity A theory can be confirmed piecemeal. Whether the evidence issplit up into two or more components or left in one piece is irrelevant to theamount of confirmation it confers. Formally, D(x, z) = D(x, y)+D(y, z). Notethat this is not the usual triangle inequality because we are in two dimensions.• skew-antisymmetry It does not matter whether h or qh is in view. Confir-mation and disconfirmation are commensurable. Formally, D(x, y) = −D(1−x, 1 − y). A surprising number of candidates fail this requirement, and therequirement is not common in the literature (see, however, the second clausein Milne’s fourth desideratum in 1996, 21). In defence of this requirement con-sider example 18 below. d1 > d2 may have a negative impact on the latterscientist’s grant application, even though the inequality may solely be due toa failure to fulfill skew-antisymmetry.• confirmation horizon An account of degree of confirmation must exhibitthe horizon effect as in List A and List B, except more simply in two di-mensions. Formally, the functions ∂D+ε /∂x must be strictly positive and the964.4. Expectations for the Geometry of Reasonfunctions ∂D−ε /∂x must be strictly negative for all ε ∈ (−1/2, 1/2). Thesefunctions are defined in (4.45) and (4.46), and I prove in appendix C that therequirement to keep them strictly positive/negative is equivalent to the horizoneffect as described formally in List B (see page 75).Example 18: Grant Adjudication I. Two scientists compete for grant money. Pro-fessor X presents an experiment conferring degree of confirmation d1 on a hypothesis,if successful; Professor Y presents an experiment conferring degree of disconfirmation−d2 on the negation of the same hypothesis, if unsuccessful. (For the relevance ofquantitative confirmation measures to the evaluation of scientific projects see Salmon,1975, 11.)The functions for the horizon effect are defined as follows. Let ε ∈ (−1/2, 1/2)be fixed. Recall that D(x, y) is the generic expression for a confirmation functionmeasuring the degree of confirmation that a posterior y = P (h|e, k) bestows on ahypothesis for which the prior is x = P (h|k). ε is the difference y − x. For ε > 0,D−ε : (0,12− ε)→ R D−ε (x) = |D(x, x+ ε)|D+ε : (12, 1− ε)→ R D+ε (x) = |D(x, x+ ε)|(4.45)For ε < 0,D−ε : (−ε, 12)→ R D−ε (x) = |D(x, x+ ε)|D+ε : (12− ε, 1)→ R D+ε (x) = |D(x, x+ ε)|(4.46)The rate of change for the different quantitative measures of degree of confirma-tion can be observed in figure 4.6. The pass and fail verdicts in the table below areevident from figure 4.6 and the table of derivatives provided in appendix C. OnlyJP , LP and ZP fulfill the horizon requirement.974.4. Expectations for the Geometry of ReasonFigure 4.6: Illustration for the six degree of confirmation candidates plus Carnap’s firmnessconfirmation and the Kullback-Leibler divergence. The top row, from left to right, illustrates FMRJ,the bottom row LGZI. ‘F’ stands for Carnap’s firmness confirmation measure FP (x, y) = log(y/(1−y)). ‘M’ stands for candidate (i), MP (x, y) in (4.44), the other letters correspond to the othercandidates (ii)-(v). ‘I’ stands for the Kullback-Leibler divergence multiplied by the sign function ofy − x to mimic a quantitative measure of confirmation. For all the squares, the colour reflects thedegree of confirmation with x on the x-axis and y on the y-axis, all between 0 and 1. The originof the square’s coordinate system is in the bottom left corner. Blue signifies strong confirmation,red signifies strong disconfirmation, and green signifies the scale between them. Perfect green isx = y. GP looks like it might pass the horizon requirement, but the derivative reveals that it failsconfirmation horizon (see appendix C).Candidate Triangularity Skew-Antisymmetry Confirmation HorizonMP pass pass failRP pass fail failJP pass fail passLP pass pass passGP pass fail failZP fail pass passThe table makes clear that only LP passes all three tests. I am not makinga strong independent case for LP here, especially against MP , which is the most984.4. Expectations for the Geometry of Reasonlikely hero of the geometry of reason. This has been done elsewhere (for example inSchlesinger, 1995, where LP and MP are compared to each other in their performancegiven some intuitive examples; Elliott Sober presents the counterargument in Sober,1994). The argumentative force of this subsection appeals to those who are alreadysympathetic to LP . Adherents of MP will hopefully find other items on List A (seepage 74) persuasive and reject the geometry of reason, in which case they may comeback to this subsection and re-evaluate their commitment to MP .Example 19: Grant Adjudication II. Two scientists compete for grant money.Professor X presents an experiment that will increase the probability of a hypothesisfrom 98% to 99%, if successful. Professor Y presents an experiment that will increasethe probability of a hypothesis from 1% to 2%, if successful.All else being equal, Professor Y should receive the grant money. If her exper-iment is more successful, it will arguably make more of a difference. This exampleillustrates that the analogy between degree of confirmation and updating remains ten-uous, since for degree of confirmation theory the consensus on intuitions is far inferiorto the updating case. If, for example, the confirmation function is anti-symmetricand D(x, y) − D(y, x) is zero (for MP and LP , for example), then together withskew-antisymmetry this means that degree of confirmation is equal for Professor Xand Professor Y. Despite its three passes in the above table, LP fails here.Based on Roberto Festa’s JP , Professor X’s prospective degree of confirmationis 5000 times larger than Professor Y’s, but Festa in particular insists “that thereis no universally applicable P -incremental c-measure, and that the appropriatenessof a P -incremental c-measure is highly context-dependent” (Festa, 1999, 67.). RPand ZP appear to be sensitive to example 19. The Kullback-Leibler divergencegives us the right result as well, where the degree of confirmation for going from1% to 2% is 3.91 · 10−3 compared to 3.12 · 10−3 for going from 98% to 99%, butthe Kullback-Leibler divergence is not a serious degree of confirmation candidate. Itfulfills skew-antisymmetry and confirmation horizon, but not additivity(see subsection 4.5.1).994.5. Expectations for Information TheoryIntuitions easily diverge here. Christensen may be correct when he says, “per-haps the controversy between difference and ratio-based positive relevance models ofquantitative confirmation reflects a natural indeterminateness in the basic notion of‘how much’ one thing supports another” (Christensen, 1999, 460.). Pluralists allowtherefore for “distinct, complementary notions of evidential support” (Ha´jek andJoyce, 2008, 123.). I am sympathetic towards this indeterminateness in degree ofconfirmation theory, but not when it comes to updating (see the full employmenttheorem in section 7.1).This subsection assumes that despite these problems with the strength of theanalogy, degree of confirmation and updating are sufficiently similar to be helpful inassociating options with each other and letting the arguments in each other’s favourand disfavour cross-pollinate. As an aside, Christensen’s S-support given by evidenceE is stable over Jeffrey conditioning on [E, qE]; LP-conditioning is not (see Chris-tensen, 1999, 451). This may serve as another argument from degree of confirmationtheory in favour of information theory (which supports Jeffrey conditioning) againstthe geometry of reason (which supports LP conditioning).4.5 Expectations for Information TheoryAsymmetry is the central feature of the difference concept that information the-ory proposes for the purpose of updating between finite probability distributions.In information theory, the information loss differs depending on whether one usesprobability distribution P to encode a message distributed according to probabilitydistribution Q, or whether one uses probability distribution Q to encode a messagedistributed according to probability distribution P . This asymmetry may very wellcarry over into the epistemic realm. Updating from one probability distribution, forexample, which has P (X) = x > 0 to P ′(X) = 0 is common. It is called stan-dard conditioning. Going in the opposite direction, however, from P (X) = 0 toP ′(X) = x′ > 0 is controversial and unusual.The Kullback-Leibler divergence, which is the most promising concept of differ-ence for probability distributions in information theory and the one which gives us1004.5. Expectations for Information TheoryBayesian standard conditioning as well as Jeffrey conditioning, is non-commutativeand may provide the kind of asymmetry required to reflect epistemic asymmetry.However, it also violates triangularity, collinear horizon, and transitiv-ity of asymmetry. The task of this section is to show how serious these violationsare.4.5.1 TriangularityAs mentioned at the end of subsection 4.3.3, the three points A,B,C in (4.12) violatetriangularity as in (4.25):DKL(A,C) > DKL(B,C) +DKL(A,B). (4.47)This is counterintuitive on a number of levels, some of which I have already hintedat in illustration: taking a shortcut while making a detour; buying a pair of shoesfor more money than buying the shoes individually.Information theory, however, does not only violate triangularity. It violatesit in a particularly egregious way. Consider any distinct two points x and z onSn−1 with coordinates xi and zi (1 ≤ i ≤ n). For simplicity, let us write δ(x, z) =DKL(z, x). Then, for any ϑ ∈ (0, 1) and an intermediate point y with coordinatesyi = ϑxi + (1− ϑ)zi, the following inequality holds true:δ(x, z) > δ (x, y) + δ (y, z) . (4.48)I will prove this in a moment, but here is a disturbing consequence: think about anever more finely grained sequence of partitions yj, j ∈ N, of the line segment fromx to z with yjk as dividing points. I will spare myself defining these partitions, butnote that any dividing point yj0k will also be a dividing point in the more finelygrained partitions yjk with j ≥ j0. Then define the sequence1014.5. Expectations for Information TheoryTj =∑kδ(yjk, yj(k+1))(4.49)such that the sum has as many summands as there are dividing points for j, plusone (for example, two dividing points divide the line segment into three possiblyunequal thirds). If δ were the Euclidean distance norm, Tj would be constant andwould equal ‖z − x‖. Zeno’s arrow moves happily along from x to z, no matter howmany stops it makes on the way. Not so for information theory and the Kullback-Leibler divergence. According to (4.48), any stop along the way reduces the sum ofdivergences.Tj is a strictly decreasing sequence (does it go to zero? – I do not know, butif yes, it would add to the poignancy of this violation). The more stops you makealong the way, the closer you bring together x and z.For the proof of (4.48), it is straightforward to see that (4.48) is equivalent ton∑i=1(zi − xi) log ϑxi + (1− ϑ)zixi> 0. (4.50)Now I use the following trick. Expand the right hand side ton∑i=1(zi +ϑ1− ϑxi −ϑ1− ϑxi − xi)log11−ϑ (ϑxi + (1− ϑ)zi)11−ϑxi> 0. (4.51)(4.51) is clearly equivalent to (4.50). It is also equivalent ton∑i=1(zi +ϑ1− ϑxi)logzi +ϑ1−ϑxi11−ϑxi+n∑i=111− ϑxi log11−ϑxizi +ϑ1−ϑxi> 0, (4.52)which is true by Gibbs’ inequality.1024.5. Expectations for Information Theory4.5.2 Collinear HorizonThere are two intuitions at work that need to be balanced: on the one hand, thegeometry of reason is characterized by simplicity, and the lack of curvature near ex-treme probabilities may be a price worth paying; on the other hand, simple examplessuch as example 17 make a persuasive case for curvature.Information theory is characterized by a very complicated ‘semi-quasimetric’ (theattribute ‘quasi’ is due to its non-commutativity, the attribute ‘semi’ to its violationof the triangle inequality). One of its initial appeals is that it performs well withrespect to the horizon requirement near the boundary of the simplex, which is alsothe location of Schlesinger’s examples. It is not trivial, however, to articulate whatthe horizon requirement really demands.collinear horizon in List B seeks to set up the requirement as weakly aspossible, only demanding that points collinear with the centre exhibit the horizoneffect. The hope is that continuity will take care of the rest, since I want the horizoneffect also for probability distributions that are not collinear with the centre. Bethat as it may, the Kullback-Leibler divergence fails collinear horizon. Here isa simple example.p =(15,25,25)p′ = q =(14,38,38)q′ =(310,720,720)(4.53)The conditions of collinear horizon in List B (see page 75) are fulfilled. If prepresents A, p′ and q represent B, and q′ represents C, then note that ‖B − A‖ =‖C −B‖ and M,A,B,C are collinear. In violation of collinear horizon,DKL(B,A) = 7.3820 · 10−3 > 6.4015 · 10−3 = DKL(C,B). (4.54)This violation of an expectation is not as serious as the violation of triangu-larity or transitivity of asymmetry. Just as there is still a reasonable dis-agreement about difference measures (which do not exhibit the horizon effect) and1034.5. Expectations for Information Theoryratio measures (which do) in degree of confirmation theory, most of us will not havestrong intuitions about the adequacy of information theory based on its violation ofcollinear horizon. One way in which I can attenuate the independent appeal ofthis violation against information theory is by making it parasitic on the asymmetryof information theory.Figure 4.7 illustrates what I mean. Consider the following two inequalities, whereM is represented by the centre m of the simplex with mi = 1/n and Y is an arbitraryprobability distribution with X as the midpoint between M and Y , so xi = 0.5(mi+yi).(i) DKL(Y,M) > DKL(M,Y ) and (ii) DKL(X,M) > DKL(Y,X) (4.55)In terms of coordinates, the inequalities reduce to(i) H(y) <1n∑(log yi)− log 1n2and (4.56)(ii) H(y) > log4n−∑[(32yi +12n)log(yi +1n)]. (4.57)(i) is simply the case described in the next subsection for asymmetry and illustratedon the bottom left of figure B.1. (ii) tells us how far from the midpoint we can gowith a scenario where p = m, p′ = q while violating collinear horizon. Clearly, asillustrated in figure 4.7, there is a relationship between asymmetry and collinearhorizon.It is opaque what motivates information theory not only to put probability distri-butions farther apart near the periphery, as I would expect, but also near the centre.I lack the epistemic intuition reflected in the behaviour. The next subsection onasymmetry deals with this lack of epistemic intuition writ large.1044.5. Expectations for Information TheoryFigure 4.7: These two diagrams illustrate inequalities (4.56) and (4.57). The former displays allpoints in red which violate collinear horizon, measured from the centre. The latter displayspoints in different colours whose orientation of asymmetry differs, measured from the centre. Thetwo red sets are not the same, but there appears to be a relationship, one that ultimately I suspectto be due to the more basic property of asymmetry.4.5.3 Transitivity of AsymmetryRecall Joyce’s two axioms Weak Convexity and Symmetry (see page 73). The geom-etry of reason (certainly in its Euclidean form) mandates Weak Convexity becausethe bisector of an isosceles triangle is always shorter than the isosceles sides. WeakConvexity also holds for information theory (see appendix A for a proof). Symmetry,however, fails for information theory. Fortunately, although I do not pursue this anyfurther here, information theory arrives at many of Joyce’s results even without theviolated axiom.Asymmetry presents a problem for the geometry of reason as well as for infor-mation theory. For the geometry of reason, the problem is akin to continuity.For information theory, the problem is the non-trivial nature of the asymmetries itinduces, which somehow need to be reconnected to epistemic justification. I willconsider this problem in a moment, but first I will have a look at the problem forthe geometry of reason.Extreme probabilities are special and create asymmetries in updating: moving in1054.5. Expectations for Information Theorydirection from certainty to uncertainty is asymmetrical to moving in direction fromuncertainty to certainty. Geometry of reason’s metric topology, however, allows forno asymmetries.Example 20: Extreme Asymmetry. Consider two cases where for case 1 theprior probabilities are Y1 = (0.4, 0.3, 0.3) and the posterior probabilities are Y′1 =(0, 0.5, 0.5); for case 2 the prior probabilities are reversed, so Y2 = (0, 0.5, 0.5) and theposterior probabilities Y ′2 = (0.4, 0.3, 0.3).Case 1 is a straightforward application of standard conditioning. Case 2 is morecomplicated: what does it take to raise a prior probability of zero to a positivenumber? In terms of information theory, the information required is infinite. Case 2is also not compatible with standard conditioning (at least not with what Alan Ha´jekcalls the ratio analysis of conditional probability, see Ha´jek, 2003). The geometryof reason may want to solve this problem by signing on to a version of regularity,but then it violates regularity. Happy kids, clean house, sanity: the haplesshomemaker must pick two. The third remains elusive. Continuity, a consistent viewof regularity, and symmetry: the hapless geometer of reason cannot have it all.Now turn to the woes of the information theorist. Given the asymmetric similaritymeasure of probability distributions that information theory requires (the Kullback-Leibler divergence), a prior probability distribution P may be closer to a posteriorprobability distribution Q than Q is to P if their roles (prior-posterior) are reversed.That is just what we would expect. The problem is that there is another posteriorprobability distribution R where the situation is just the opposite: prior P is furtheraway from posterior R than prior R is from posterior P . And whether a probabilitydistribution different from P is of the Q-type or of the R-type escapes any epistemicintuition.For simplicity, let us consider probability distributions and their associated cre-dence functions on an event space with three mutually exclusive atoms Ω = {ω1, ω2, ω3}.The simplex S2 represents all of these probability distributions. Every point p in S2representing a probability distribution P induces a partition on S2 into points that1064.5. Expectations for Information Theoryare symmetric to p, positively skew-symmetric to p, and negatively skew-symmetricto p given the topology of information theory.In other words, if∆P (P′) = DKL(P ′, P )−DKL(P, P ′), (4.58)then, holding P fixed, S2 is partitioned into three regions,∆−1(R>0) ∆−1(R<0) ∆−1({0}) (4.59)One could have a simple epistemic intuition such as ‘it takes less to update froma more uncertain probability distribution to a more certain probability distributionthan the reverse direction,’ where the degree of certainty in a probability distributionis measured by its entropy. This simple intuition accords with what we said aboutextreme probabilities and it holds true for the asymmetric distance measure definedby the Kullback-Leibler divergence in the two-dimensional case where Ω has onlytwo elements (see appendix B).In higher-dimensional cases, however, the tripartite partition (4.59) is non-trivial—some probability distributions are of the Q-type, some are of the R-type, and it isdifficult to think of an epistemic distinction between them that does not alreadypresuppose information theory. See figure B.1 for graphical illustration of this point.On any account of well-behaved and ill-behaved asymmetries, the Kullback-Leibler divergence is ill-behaved. Of the four axioms as listed by Ralph Koppermanfor a distance measure d (see Kopperman, 1988, 89), the Kullback-Leibler divergenceviolates both symmetry and triangularity, making it a ‘semi-quasimetric’:(m1) d(x, x) = 0(m2) d(x, z) ≤ d(x, y) + d(y, z) (triangularity)(m3) d(x, y) = d(y, x) (symmetry)1074.5. Expectations for Information Theory(m4) d(x, y) = 0 implies x = y (separation)The Kullback-Leibler divergence not only violates symmetry and triangularity,but also transitivity of asymmetry. For a description of transitivity ofasymmetry see List B on page 75. For an example of it, considerP1 =(12,14,14)P2 =(13,13,13)P3 =(25,25,15)(4.60)In the terminology of transitivity of asymmetry in List B, (P1, P2) is asym-metrically positive, and so is (P2, P3). The reasonable expectation is that (P1, P3) isasymmetrically positive by transitivity, but for the example in (4.60) it is asymmet-rically negative.How counterintuitive this is (epistemically and otherwise) is demonstrated by thefact that in MDS (the multi-dimensional scaling of distance relationships) almost allasymmetric distance relationships under consideration are asymmetrically transitivein this sense, for examples see international trade in Chino, 1978; journal citationin Coombs, 1964; car switch in Harshman et al., 1982; telephone calls in Harshmanand Lundy, 1984; interaction or input-output flow in migration, economic activity,and social mobility in Coxon, 1982; flight time between two cities in Gentlemanet al., 2006, 191; mutual intelligibility between Swedish and Danish in van Ommenet al., 2013, 193; Tobler’s wind model in Tobler, 1975; and the cyclist lovingly hand-sketched in Kopperman, 1988, 91.This ‘ill behaviour’ of information theory begs for explanation, or at least clas-sification (it would help, for example, to know that all reasonable non-commutativedifference measures used for updating are ill-behaved). Kopperman’s objective isprimarily to rescue continuity, uniform continuity, Cauchy sequences, and limits fortopologies induced by difference measures which violate triangularity, symmetry,and/or separation. Kopperman does not touch axiom (m1), while in the psychologi-cal literature (see especially Tversky, 1977) self-similarity is an important topic. Thisis why an initially promising approach to asymmetric modeling in Hilbert spaces byChino (see Chino, 1978; Chino, 1990; Chino and Shiraiwa, 1993; and Saburi and1084.5. Expectations for Information TheoryChino, 2008) will not help us to distinguish well-behaved and ill-behaved asymme-tries between probability distributions. I am explaining the reasons in appendix D.The failure of Chino’s modeling approach to make useful distinctions amongasymmetric distance measures between probability distributions leads us to the morecomplex theory of information geometry and differentiable manifolds. Both the re-sults of Shun-ichi Amari (see Amari, 1985; and Amari and Nagaoka, 2000) andNikolai Chentsov (see Chentsov, 1982) serve to highlight the special properties ofthe Kullback-Leibler divergence, not without elevating the discussion to a level ofmathematical sophistication, however, where it is difficult to retain the appeal toepistemic intuitions. Information geometry considers probability distributions asdifferentiable manifolds equipped with a Riemannian metric. This metric, however,is Fisher’s information metric, not the Kullback-Leibler divergence, and it is definedon the tangent space of the simplex representing finite-dimensional probability dis-tributions. There is a sense in which the Fisher information metric is the derivativeof the Kullback-Leibler divergence, and so the connection to epistemic intuitions canbe re-established.For a future research project, it would be lovely either to see information the-ory debunked in favour of an alternative geometry (this chapter has demonstratedthat this alternative will not be the geometry of reason); or to see uniqueness re-sults for the Kullback-Leibler divergence to show that despite its ill behaviour theKullback-Leibler is the right asymmetric distance measure on which to base infer-ence and updating. Chentsov’s theory of monotone invariance and Amari’s theory ofα-connections are potential candidates to provide such results as well as an epistemicjustification for information theory.Leitgeb and Pettigrew’s reasoning to establish LP conditioning on the basis ofthe geometry of reason is valid. Given the failure of LP conditioning with respect toexpectations in List A, it cannot be sound. The premise to reject is the geometryof reason. A competing approach, information theory, yields results that fulfill allof these expectations except horizon. Information theory, however, fails two otherexpectations identified in List B—expectations which the geometry of reason fulfills.I am left with loose ends and ample opportunity for further work. The epistemic1094.5. Expectations for Information Theoryutility approach, itself a relatively recent phenomenon, needs to come to a deeperunderstanding of its relationship with information theory. It is an open question, forexample, if it is possible to provide a complete axiomatization consistent with infor-mation theory to justify probabilism, standard conditioning, and Jeffrey conditioningfrom an epistemic utility approach as Shore and Johnston have done from a prag-matic utility approach. It is also an open question, given the results of this chapter, ifthere is hope reconciling information theory with intuitions we have about epistemicutility and its attendant quantitative concept of difference for partial beliefs.110Chapter 5A Natural Generalization ofJeffrey Conditioning5.1 Two GeneralizationsCarl Wagner presents a case where Jeffrey conditioning does not apply but intuitiondictates a solution in keeping with the same principle that motivates Jeffrey condi-tioning (see Wagner, 1992). I will sometimes call this intuition (W) for short, butusually I will call it Jeffrey’s Updating Principle jup. Wagner’s (W) solution, orWagner conditioning, generalizes Jeffrey conditioning. pme, of course, also general-izes Jeffrey conditioning. Wagner investigates whether his generalization (W) agreeswith pme, finding that it does not. He then uses his method not only to present a“natural generalization of Jeffrey conditioning” (see Wagner, 1992, 250), but also todeepen criticism of pme.I will show that pme not only generalizes Jeffrey conditioning (as is well known,for a formal proof see Caticha and Giffin, 2006) but also Wagner conditioning. Wag-ner’s intuition (W) is plausible, and his method works. His derivation of a disagree-ment with pme, however, is conceptually more complex than he assumes. Specifically,his derivation relies on an assumption that is rejected by advocates of PME: thatrational agents can (indeed, should) have imprecise credences. In this chapter, Ishow that if agents have sharp credences, Wagner conditioning actually agrees withPME. In chapter 6, I examine the debate about imprecise credences.Below, I will show that pme and (W) are consistent given (L). (L) is what Icall the Laplacean principle which requires a rational agent, besides other standardBayesian commitments, to hold sharp credences with respect to well-defined events1115.1. Two Generalizationsunder consideration. (I), which is inconsistent with (L) and which some Bayesiansaccept, allows a rational agent to have indeterminate or imprecise credences (seeEllsberg, 1961; Levi, 1985; Walley, 1991; and Joyce, 2010). The following tablesummarizes the positions in diagram form. ‘X’ means that the positions with amark (‘•’) are consistent with each other; ‘×’ means that they are inconsistent witheach other.pme (W) (I) (L)• • × according to Wagner’s article• • X according to this article• • × disagree over permitting indeterminacy• • • × formally shown in Wagner’s article• • • X formally shown hereWhile Wagner is welcome to deny (L), my sense is that advocates of pme usuallyaccept it because they are already to the moderate right of Sandy L. Zabell’s spec-trum between left-wing dadaists and right-wing totalitarians (see Zabell, 2005, 27;Zabell’s representative of right-wing totalitarianism is E.T. Jaynes). If there werean advocate of pme sympathetic to (I), Wagner’s result would indeed force her tochoose. Wagner’s criticism of pme is misplaced, however, since it rests on a hiddenassumption that someone who believes in pme would tend not to hold. Wagner doesnot give an independent argument for (I). This chapter shows how elegantly pmegeneralizes not only standard conditioning and Jeffrey conditioning but also Wagnerconditioning, once I accept (L). In chapter 6, I provide independent reasons why weshould accept (L).A tempered and differentiated account of pme (contrasted with Jaynes’ right-wing earlier version) is not only largely immune to criticisms, but often illuminatesthe problems that the criticisms pose (for example the Judy Benjamin case; seechapter 7). This account rests on principles such as (L) and a reasonable interpreta-tion of what I mean by objectivity (see my comments about Donkin’s objectivism insection 2.1). To make this more clear, and before I launch into the formalities of gen-eralizing Wagner conditioning by using pme, let us articulate (L). (L) is what I callthe Laplacean principle and in addition to standard Bayesian commitments states1125.1. Two Generalizationsthat a rational agent assigns a determinate precise probability to a well-defined eventunder consideration (for a defence of (L) against (I) see White, 2010; and Elga, 2010).(I) is the negation of (L): rational agents sometimes do not assign determinate preciseprobabilities to a well-defined event under consideration.To avoid excessive apriorism (see Seidenfeld, 1979), (L) does not require that arational agent has probabilities assigned to all events in an event space, only that,once an event has been brought to attention, and sometimes retrospectively, therational agent is able to assign a sharp probability (for more detail on this issue andsome rules about which propositions need to be assigned subjective probabilities seeHa´jek, 2003, 313). Newton did not need to have a prior probability for Einstein’stheory in order to have a posterior probability for his theory of gravity.(L) also does not require objectivity in the sense that all rational agents mustagree in their probability distributions if they have the same information. It is im-portant to distinguish between type I and type II prior probabilities (I have also beencalling them absolutely and relatively prior probabilities). The former precede anyinformation at all (so-called ignorance priors). The latter are simply prior relative toposterior probabilities in probability kinematics. They may themselves be posteriorprobabilities with respect to an earlier instance of probability kinematics.The case for objectivity in probability kinematics, where prior probabilities areof type II, is consistent with and dependent on a subjectivist interpretation of prob-abilities, making for some terminological confusion. Interpretations of the evidenceand how it is to be cast in terms of formal constraints may vary. Once we agree on aprior distribution (type II), however, and on a set of formal constraints representingour evidence, pme claims that posterior probabilities follow mechanically. Just asis the case in deductive logic, we may come to a tentative and voluntary agreementon an interpretation, a set of rules and presuppositions and then go part of the waytogether. For the interplay between pme and Infomin, see section 3.2.Some advocates of pme may find (L) too weak in its claims, but none think itis too strong. Once (L) is assumed, however, Wagner’s diagnosis of disagreementbetween (W) and pme fails. Moreover, pme and (L) together seamlessly generalizeWagner conditioning. In the remainder of this chapter I will provide a sketch of1135.2. Wagner’s Natural Generalization of Jeffrey Conditioninga formal proof for this claim. I make the case for (L) in chapter 6. A welcomeside-effect of reinstating harmony between pme and (W) is that it provides an in-verse procedure to Vladimı´r Majern´ık’s method of finding marginals based on givenconditional probabilities (see Majern´ık, 2000; and section 5.4).5.2 Wagner’s Natural Generalization of JeffreyConditioningWagner claims that he has found a relatively common case of probability kinematicsin which pme delivers the wrong result so that we must develop an ad hoc general-ization of Jeffrey conditioning. This is best explained by using Wagner’s example,the Linguist problem (see Wagner, 1992, 252 and Spohn, 2012, 197).Example 21: Linguist. You encounter the native of a certain foreign countryand wonder whether he is a Catholic northerner (θ1), a Catholic southerner (θ2), aProtestant northerner (θ3), or a Protestant southerner (θ4). Your prior probability pover these possibilities (based, say, on population statistics and the judgment that itis reasonable to regard this individual as a random representative of his country) isgiven by p(θ1) = 0.2, p(θ2) = 0.3, p(θ3) = 0.4, and p(θ4) = 0.1. The individual nowutters a phrase in his native tongue which, due to the aural similarity of the phrases inquestion, might be a traditional Catholic piety (ω1), an epithet uncomplimentary toProtestants (ω2), an innocuous southern regionalism (ω3), or a slang expression usedthroughout the country in question (ω4). After reflecting on the matter you assignsubjective probabilities u(ω1) = 0.4, u(ω2) = 0.3, u(ω3) = 0.2, and u(ω4) = 0.1 tothese alternatives. In the light of this new evidence how should you revise p?Let Θ = {θi : i = 1, . . . , 4},Ω = {ωi : i = 1, . . . , 4}. Let Γ : Ω→ 2Θ − {∅} be thefunction which maps ω to Γ(ω), the narrowest event in Θ entailed by the outcomeω ∈ Ω. Here are two definitions that take advantage of the apparatus established byArthur Dempster (see Dempster, 1967). I will need m and b to articulate Wagner’s(W) solution for Linguist type problems.1145.2. Wagner’s Natural Generalization of Jeffrey ConditioningFor all E ⊆ Θ,m(E) = u({ω ∈ Ω : Γ(ω) = E}). (5.1)For all E ⊆ Θ, b(E) =∑H⊆Em(H) = u({ω ∈ Ω : Γ(ω) ⊆ E}). (5.2)Let Q be the posterior joint probability measure on Θ×Ω, and QΘ the marginal-ization of Q to Θ, QΩ the marginalization of Q to Ω. Wagner plausibly suggests thatQ is compatible with u and Γ if and only iffor all θ ∈ Θ and for all ω ∈ Ω, θ /∈ Γ(ω) implies that Q(θ, ω) = 0 (5.3)andQΩ = u. (5.4)The two conditions (5.3) and (5.4), however, are not sufficient to identify a“uniquely acceptable revision of a prior” (Wagner, 1992, 250). Wagner’s proposalincludes a third condition, which extends Jeffrey’s rule to the situation at hand. Wewill call it (W). To articulate the condition, I need some more definitions. For allE ⊆ Θ, let EF = {ω ∈ Ω : Γ(ω) = E}, so that m(E) = u(EF). For all A ⊆ Θ andall B ⊆ Ω, let “A” = A × Ω and “B” = Θ × B, so that Q(“A”) = QΘ(A) for allA ⊆ Θ and Q(“B”) = QΩ(B) for all B ⊆ Ω. Let also E = {E ⊆ Θ : m(E) > 0} bethe family of evidentiary focal elements.According to Wagner only those Q satisfying the conditionfor all A ⊆ Θ and for all E ∈ E , Q(“A”|“EF”) = p(A|E) (5.5)1155.2. Wagner’s Natural Generalization of Jeffrey Conditioningare eligible candidates for updated joint probabilities in Linguist type problems.To adopt (5.5), says Wagner, is to make sure that the total impact of the occurrenceof the event EF is to preclude the occurrence of any outcome θ /∈ E, and that, withinE, p remains operative in the assessment of relative uncertainties (see Wagner, 1992,250). While conditions (5.3), (5.4) and (5.5) may admit an infinite number of jointprobability distributions on Θ×Ω, their marginalizations to Θ are identical and giveus the desired posterior probability, expressible by the formulaq(A) =∑E∈Em(E)p(A|E). (5.6)So far I am in agreement with Wagner. Wagner’s scathing verdict about pmetowards the end of his article, however, is not really a verdict about pme in theLaplacean tradition but about the curious conjunction of pme and (I). Wagner neverinvokes (I); it remains an enthymematic assumption in Wagner’s argument.Students of maximum entropy approaches to probability revision may [. . . ] won-der if the probability measure defined by our formula (5.6) similarly minimizes [theKullback-Leibler information number]Dkl(q, p) over all probability measures q boundedbelow by b. The answer is negative [. . . ] convinced by Skyrms, among others, thatpme is not a tenable updating rule, we are undisturbed by this fact. Indeed, we takeit as additional evidence against pme that (5.6), firmly grounded on [. . . ] a consid-ered judgment that (5.5) holds, might violate pme [. . . ] the fact that Jeffrey’s rulecoincides with pme is simply a misleading fluke, put in its proper perspective by thenatural generalization of Jeffrey conditioning described in this paper. [References toformulas and notation modified.] (Wagner, 1992, 255.)In the next section, I will contrast what Wagner considers to be the solution ofpme for this problem, ‘Wagner’s pme solution,’ and Wagner’s solution presented inthis section, ‘Wagner’s (W) solution,’ and show, in much greater detail than Wagnerdoes, why Wagner’s pme solution misrepresents pme.1165.3. Wagner’s PME Solution5.3 Wagner’s PME SolutionWagner’s pme solution assumes the constraint that b must act as a lower bound forthe posterior probability. Consider E12 = {θ1 ∨ θ2}. Because both ω1 and ω2 entailE12, according to (5.2), b(E12) = 0.70. It makes sense to consider it a constraintthat the posterior probability for E12 must be at least b(E12). Then I choose fromall probability distributions fulfilling the constraint the one which is closest to theprior probability distribution, using the Kullback-Leibler divergence.Wagner applies this idea to the marginal probability distribution on Θ. He doesnot provide the numbers, but refers to simpler examples to make his point that pmedoes not generally agree with his solution. To aid the discussion, I want to populateWagner’s claim for the Linguist problem with numbers. Using proposition 1.29 inDimitri Bertsekas’ book Constrained Optimization and Lagrange Multiplier Methods(see Bertsekas, 1982, 71) and some non-trivial calculations, Wagner’s pme solutionfor the Linguist problem (indexed Qwm) isβ˜ = (Qwm(θj))ᵀ = (0.30, 0.45, 0.10, 0.15)ᵀ. (5.7)A brief remark about notation: I will use α for vectors expressing ωi probabilitiesand β for vectors expressing θj probabilities. I will use a tilde as in β˜ or a hat as inβˆ for posteriors, while priors remain without such ornamentation. The tilde is usedfor Wagner’s pme solution (which, as we will see, is incorrect) and the hat for thecorrect solution (both (W) and pme).The cross-entropy between β˜ and the priorβ = (P (θj))ᵀ = (0.20, 0.30, 0.40, 0.10)ᵀ (5.8)is indeed significantly smaller than the cross-entropy between Wagner’s (W) solution1175.3. Wagner’s PME Solutionβˆ = (Q(θj))ᵀ = (0.30, 0.60, 0.04, 0.06)ᵀ (5.9)and the prior β (0.0823 compared to 0.4148). For the cross-entropy, I use theKullback-Leibler DivergenceDkl(q, p) =∑jq(θj) log2q(θj)p(θj). (5.10)From the perspective of a pme advocate, there are only two explanations forthis difference in cross-entropy. Either Wagner’s (W) solution illegitimately usesinformation not contained in the problem, or Wagner’s pme solution has failed toinclude information that is contained in the problem. I will simplify the Linguistproblem in order to show that the latter is the case.Example 22: Simplified Linguist. Imagine the native is either Protestant orCatholic (50:50). Further imagine that the utterance of the native either entailsthat the native is a Protestant (60%) or provides no information about the religiousaffiliation of the native (40%).Using (5.6), the posterior probability distribution is 80:20 (Wagner’s (W) solutionand, surely, the correct solution). Using b as a lower bound and pme, Wagner’spme solution for this radically simplified problem is 60:40, clearly a more entropicsolution than Wagner’s (W) solution. The problem, as I will show, is that Wagner’spme solution does not take into account (L), which an pme advocate would naturallyaccept.For a Laplacean, the prior joint probability distribution on Θ× Ω is not left un-specified for the calculation of the posteriors. Before the native makes the utterance,the event space is unspecified with respect to Ω. After the utterance, however, theevent space is defined (or brought to attention) and populated by prior probabilitiesaccording to (L). That this happens retrospectively may or may not be a problem:1185.3. Wagner’s PME SolutionBayes’ theorem is frequently used retrospectively, for example when the anomalousprecession of Mercury’s perihelion, discovered in the mid-1800s, was used to confirmAlbert Einstein’s General Theory of Relativity in 1915 (for retrospective conditioningsee Grove and Halpern, 1997; and Diaconis and Zabell, 1982, 822). I shall bracketfor now that this procedure is controversial and refer the reader to the literature onOld Evidence.Ariel Caticha and Adom Giffin make the following appeal:Bayes’ theorem requires that P (ω, θ) be defined and that assertions such as ‘ω and θ’be meaningful; the relevant space is neither Ω nor Θ but the product Ω×Θ [notationmodified] (Caticha and Giffin, 2006, 9.)Following (L) I shall populate the joint probability matrix P on Ω × Θ, whichis a matter of synchronic constraints, as updating the joint probability P to Q onΩ×Θ is a matter of diachronic constraints. For the Simplified Linguist problem, thisprocedure gives us the correct result, agreeing with Wagner’s (W) solution (80:20).There is a more general theorem which incorporates Wagner’s (W) method intoLaplacean realism and pme orthodoxy. The proof of this theorem is in the nextsection. Its validity is confirmed by how well it works for the Linguist problem (aswell as the Simplified Linguist problem).I have not yet formally demonstrated that for all Wagner-type problems (β, αˆ, κ),the correct pme solution (versus Wagner’s deficient pme solution) agrees with Wag-ner’s (W) solution, although I have established a useful framework and demonstratedthe agreement for the Linguist problem. As Vladimı´r Majern´ık has shown how toderive marginal probabilities from conditional probabilities using pme (see Majern´ık,2000), in the next section I will inversely show how to derive conditional probabili-ties (i.e. the joint probability matrices) from the marginal probabilities and logicalrelationships provided in Wagner-type problems. This technical result together withthe claim established in the present paper that Wagner’s intuition (W) is consistentwith pme, given (L), underlines the formal and conceptual virtue of pme.1195.4. Maximum Entropy and Probability Kinematics Constrained by Conditionals5.4 Maximum Entropy and ProbabilityKinematics Constrained by ConditionalsJeffrey conditioning is a method of update (recommended first by Richard Jeffreyin Jeffrey, 1965) which generalizes standard conditioning and operates in probabilitykinematics where evidence is uncertain (P (E) 6= 1). Sometimes, when we reasoninductively, outcomes that are observed have entailment relationships with partitionsof the possibility space that pose challenges that Jeffrey conditioning cannot meet.As we will see, it is not difficult to resolve these challenges by generalizing Jeffreyconditioning. There are claims in the literature that the principle of maximumentropy, from now on pme, conflicts with this generalization. I will show under whichconditions this conflict obtains. Since proponents of pme are unlikely to subscribeto these conditions, the position of pme in the larger debate over inductive logic andreasoning is not undermined.In his paper “Marginal Probability Distribution Determined by the Maximum En-tropy Method” (see Majern´ık, 2000), Vladimı´r Majern´ık asks the following question:If I had two partitions of an event space and knew all the conditional probabilities(any conditional probability of one event in the first partition conditional on anotherevent in the second partition), would I be able to calculate the marginal probabilitiesfor the two partitions? The answer is yes, if I commit myself to pme.For Majern´ık’s question, pme provides us with a unique and plausible answer.We may also be interested in the obverse question: if the marginal probabilities ofthe two partitions were given, would we similarly be able to calculate the conditionalprobabilities? The answer is yes: given pme, Theorems 2.2.1. and 2.6.5. in Coverand Thomas, 2006 reveal that the joint probabilities are the product of the marginalprobabilities (see also Debbah and Mu¨ller, 2005). Once the joint probabilities andthe marginal probabilities are available, it is trivial to calculate the conditional prob-abilities.It is important to note that these joint probabilities do not legislate independence,even though they allow it. Me´rouane Debbah and Ralf Mu¨ller correctly describe thesejoint probabilities as a model with as many degrees of freedom as possible, which1205.4. Maximum Entropy and Probability Kinematics Constrained by Conditionalsleaves free degrees for correlation to exist or not (see Debbah and Mu¨ller, 2005, 1674).I will now present Jeffrey conditioning in unusual notation, anticipating using thisnotation to solve Wagner’s Linguist problem and to give a general solution for theobverse Majern´ık problem. Let Ω be a finite event space and {θj}j=1,...,n a partitionof Ω. Let κ be an m × n matrix for which each column contains exactly one 1,otherwise 0. Let P = Pprior and Pˆ = Pposterior. Then {ωi}i=1,...,m, for whichωi =⋃j=1,...,nθ∗ij, (5.11)is likewise a partition of Ω (the ω are basically a more coarsely grained partition thanthe θ). θ∗ij = ∅ if κij = 0, θ∗ij = θj otherwise. Let β be the vector of prior probabilitiesfor {θj}j=1,...,n(P (θj) = βj) and βˆ the vector of posterior probabilities (Pˆ (θj) =βˆj); likewise for α and αˆ corresponding to the prior and posterior probabilities for{ωi}i=1,...,m, respectively.A Jeffrey-type problem is when β and αˆ are given and we are looking for βˆ. Amathematically more concise characterization of a Jeffrey-type problem is the triple(κ, β, αˆ). The solution, using Jeffrey conditioning, isβˆj = βjn∑i=1κijαˆi∑ml=1 κilβlfor all j = 1, . . . , n. (5.12)The notation is more complicated than it needs to be for Jeffrey conditioning. Insection 5, however, I will take full advantage of it to present a generalization wherethe ωi do not range over the θj. In the meantime, here is an example to illustrate(5.12).Example 23: Colour Blind. A token is pulled from a bag containing 3 yellowtokens, 2 blue tokens, and 1 purple token. You are colour blind and cannot distinguishbetween the blue and the purple token when you see it. When the token is pulled, itis shown to you in poor lighting and then obscured again.1215.4. Maximum Entropy and Probability Kinematics Constrained by ConditionalsYou come to the conclusion based on your observation that the probability thatthe pulled token is yellow is 1/3 and that the probability that the pulled tokenis blue or purple is 2/3. What is your updated probability that the pulled tokenis blue? Let P (blue) be the prior subjective probability that the pulled token isblue and Pˆ (blue) the respective posterior subjective probability. Jeffrey condition-ing, based on jup (which mandates, for example, that Pˆ (blue|blue or purple) =P (blue|blue or purple)) recommendsPˆ (blue) =Pˆ (blue|blue or purple)Pˆ (blue or purple)+Pˆ (blue|neither blue nor purple)Pˆ (neither blue nor purple)=P (blue|blue or purple)Pˆ (blue or purple) = 4/9 (5.13)In the notation of (5.12), example 23 is calculated with β = (1/2, 1/3, 1/6)>, αˆ =(1/3, 2/3)>,κ =[1 0 00 1 1](5.14)and yields the same result as (5.13) with βˆ2 = 4/9.Wagner’s Linguist problem is an instance of the more general obverse Majern´ıkproblem where partitions are given with logical relationships between them as wellas some marginal probabilities. Wagner-type problems seek as a solution missingmarginals, while obverse Majern´ık problems seek the conditional probabilities aswell, both of which I will eventually provide using pme.In order to show how pme generalizes Jeffrey conditioning (see subsection 3.2.5)and Wagner conditioning to boot, I use the notation that I have already introducedfor Jeffrey conditioning. We can characterize Wagner-type problems analogouslyto Jeffrey-type problems by a triple (κ, β, αˆ). {θj}j=1,...,n and {ωi}i=1,...,m now referto independent partitions of Ω, i.e. (5.11) need not be true. Besides the marginal1225.4. Maximum Entropy and Probability Kinematics Constrained by Conditionalsprobabilities P (θj) = βj, Pˆ (θj) = βˆj, P (ωi) = αi, Pˆ (ωi) = αˆi, we therefore also havejoint probabilities µij = P (ωi ∩ θj) and µˆij = Pˆ (ωi ∩ θj).Given the specific nature of Wagner-type problems, there are a few constraintson the triple (κ, β, αˆ). The last row (µmj)j=1,...,n is special because it represents theprobability of ωm, which is the negation of the events deemed possible after theobservation. In the Linguist problem, for example, ω5 is the event (initially highlylikely, but impossible after the observation of the native’s utterance) that the nativedoes not make any of the four utterances. The native may have, after all, uttered atypical Buddhist phrase, asked where the nearest bathroom was, complimented yourfedora, or chosen to be silent. κ will have all 1s in the last row. Let κˆij = κij fori = 1, . . . ,m − 1 and j = 1, . . . , n; and κˆmj = 0 for j = 1, . . . , n. κˆ equals κ exceptthat its last row are all 0s, and αˆm = 0. Otherwise the 0s are distributed over κ(and equally over κˆ) so that no row and no column has all 0s, representing the logicalrelationships between the ωis and the θjs (κij = 0 if and only if Pˆ (ωi∩θj) = µij = 0).We set P (ωm) = x (Pˆ (ωm) = 0), where x depends on the specific prior knowledge.Fortunately, the value of x cancels out nicely and will play no further role. Forconvenience, I defineζ = (0, . . . , 0, 1)> (5.15)with ζm = 1 and ζi = 0 for i 6= m.The best way to visualize such a problem is by providing the joint probabilitymatrix M = (µij) together with the marginals α and β in the last column/row, herefor example as for the Linguist problem with m = 5 and n = 4 (note that this isnot the matrix M , which is m×n, but M expanded with the marginals in impropermatrix notation):1235.4. Maximum Entropy and Probability Kinematics Constrained by Conditionalsµ11 µ12 0 0 α1µ21 µ22 0 0 α20 µ32 0 µ34 α3µ41 µ42 µ43 µ44 α4µ51 µ52 µ53 µ54 xβ1 β2 β3 β4 1.00. (5.16)The µij 6= 0 where κij = 1. Ditto, mutatis mutandis, for Mˆ, αˆ, βˆ. To make this alittle less abstract, Wagner’s Linguist problem is characterized by the triple (κ, β, αˆ),κ =1 1 0 01 1 0 00 1 0 11 1 1 11 1 1 1 and κˆ =1 1 0 01 1 0 00 1 0 11 1 1 10 0 0 0 (5.17)β = (0.2, 0.3, 0.4, 0.1)> and αˆ = (0.4, 0.3, 0.2, 0.1, 0)>. (5.18)Wagner’s solution, based on jup, isβˆj = βjm−1∑i=1κˆijαˆi∑κˆil=1βlfor all j = 1, . . . , n. (5.19)In numbers,βˆj = (0.3, 0.6, 0.04, 0.06)>. (5.20)1245.4. Maximum Entropy and Probability Kinematics Constrained by ConditionalsThe posterior probability that the native encountered by the linguist is a northerner,for example, is 34%. Wagner’s notation is completely different and never specifies orprovides the joint probabilities, but I hope the reader appreciates both the analogyto (5.12) underlined by this notation as well as its efficiency in delivering a correctpme solution for us. The solution that Wagner attributes to pme is misleadingbecause of Wagner’s Dempsterian setup which does not take into account that pro-ponents of pme are likely to be proponents of the classical Bayesian position thattype II prior probabilities are specified and determinate once the agent attends tothe events in question. Some Bayesians in the current discussion explicitly disavowthis requirement for (possibly retrospective) determinacy (especially James Joyce in2010 and other papers). Proponents of pme (a proper subset of Bayesians), however,are unlikely to follow Joyce—if they did, they would indeed have to address Wagner’sexample to show that their allegiances to pme and to indeterminacy are compatible.That (5.19) follows from jup is well-documented in Wagner’s paper. For the pmesolution for this problem, I will not use (5.19) or jup, but maximize the entropy forthe joint probability matrixM and then minimize the cross-entropy between the priorprobability matrix M and the posterior probability matrix Mˆ . The pme solution,despite its seemingly different ancestry in principle, formal method, and assumptions,agrees with (5.19). This completes our argument.To maximize the Shannon entropy of M and minimize the Kullback-Leibler di-vergence between Mˆ and M , consider the Lagrangian functions:Λ(µij, ξ) =∑κij=1µij log µij +n∑j=1ξjβj − ∑κkj=1µkj+λm(x−n∑j=1µmj)(5.21)and1255.4. Maximum Entropy and Probability Kinematics Constrained by ConditionalsΛˆ(µˆij, λˆ) =∑κˆij=1µˆij logµˆijµij+m∑i=1λˆi(αˆi −∑κˆil=1µˆil). (5.22)For the optimization, I set the partial derivatives to 0, which results inM = rs> ◦ κ (5.23)Mˆ = rˆs> ◦ κˆ (5.24)β = Sκ>r (5.25)αˆ = Rˆκs (5.26)where ri = eζiλm , sj = e−1−ξj , rˆi = e−1−λˆi represent factors arising from the La-grange multiplier method (ζ was defined in (5.15)). The operator ◦ is the entry-wiseHadamard product in linear algebra. r, s, rˆ are the vectors containing the ri, sj, rˆi,respectively. R, S, Rˆ are the diagonal matrices with Ril = riδil, Skj = sjδkj, Rˆil = rˆiδil(δ is Kronecker delta).Note thatβj∑κˆil=1βl=sj∑κˆil=1slfor all (i, j) ∈ {1, . . . ,m− 1} × {1, . . . , n}. (5.27)1265.4. Maximum Entropy and Probability Kinematics Constrained by Conditionals(5.26) impliesrˆi =αˆi∑κˆil=1slfor all i = 1, . . . ,m− 1. (5.28)Consequently,βˆj = sjm−1∑i=1κˆijαˆi∑κil=1slfor all j = 1, . . . , n. (5.29)(5.29) gives us the same solution as (5.19), taking into account (5.27). Therefore,Wagner conditioning and pme agree.Wagner-type problems (but not obverse Majern´ık-type problems) can be solvedusing jup and Wagner’s ad hoc method. Obverse Majern´ık-type problems, and there-fore all Wagner-type problems, can also be solved using pme and its established andintegrated formal method. What at first blush looks like serendipitous coincidence,namely that the two approaches deliver the same result, reveals that jup is safelyincorporated in pme. Not to gain information where such information gain is unwar-ranted and to process all the available and relevant information is the intuition atthe foundation of pme. My results show that this more fundamental intuition gener-alizes the more specific intuition that ratios of probabilities should remain constantunless they are affected by observation or evidence. Wagner’s argument that pmeconflicts with jup is ineffective because it rests on assumptions that proponents ofpme naturally reject.127Chapter 6A Problem for IndeterminateCredal States6.1 Booleans and LaplaceansThe claim defended in this chapter is that rational agents are subject to a normrequiring sharp credences. I defend this claim in spite of the initially promisingfeatures of indeterminate credal states to address problems that sharp credences facein reflecting an agent’s doxastic state.Traditionally, Bayesians have maintained that a rational agent, when holding acredence, holds a sharp credence. It has recently become popular to drop the re-quirement for credence functions to be sharp. There are now Bayesians who permit arational agent to hold indeterminate credal states based on incomplete or ambiguousevidence. I will refer to Bayesians who continue to adhere to the classical theoryof sharp credences for rational agents as ‘Laplaceans’ (e.g. Adam Elga and RogerWhite). I will refer to Bayesians who do not believe that a rational agent’s credencesneed to be sharp as ‘Booleans’ (e.g. Richard Jeffrey, Peter Walley, Brian Weatherson,and James Joyce). Historically, George Boole and Pierre-Simon Laplace have heldpositions vaguely related to the ones discussed here (see, for example, Boole, 1854,chapters 16–21).There is some terminological confusion around the adjectives ‘imprecise,’ ‘inde-terminate,’ and ‘mushy’ credences. In the following, I will exclusively refer to inde-terminate credences or credal states (abbreviated ‘instates’) and mean by them a setof sharp credence functions (which some Booleans require to be convex) which it maybe rational for an agent to hold within an otherwise orthodox Bayesian framework1286.1. Booleans and Laplaceans(see Jeffrey, 1983).More formally speaking, let Ω be a set of state descriptions or possible worlds andX a suitable algebra on Ω. Call CX , which is a set of probability functions on X , acredal state with respect to X . Sometimes the credal state is required to be convexso that P1 ∈ CX and P3 ∈ CX imply P2 ∈ CX if P2 = ϑP1 + (1− ϑ)P3. ϑ is a scalarbetween 0 and 1, which is multiplied by a probability function using conventionalscalar multiplication.The credal state is potentially different from an agent’s doxastic state, which canbe characterized in more detail than the credal state (examples will follow). Thedoxastic state of a rational agent contains all the information necessary to update,infer, and make decisions. Since updating, inference, and decision-making generallyneeds the quantitative information in a credal state, the credal state is a substateof the doxastic state. Credal states group together doxastic states which are in-distinguishable on their formal representation by CX . Laplaceans require that thecardinality of CX is 1. Booleans have a less rigid requirement of good behaviour forCX , for example they may require it to be a Borel set on the associated vector spaceif Ω is finite. CX is a set restricted to probability functions for both Laplaceans andBooleans because both groups are Bayesian.In the following, I will sometimes say that a credal state with respect to a propo-sition is a sub-interval of the unit interval, for example (1/3, 2/3). This is a looseway of speaking, since credal states are sets of probability functions, not set-valuedfunctions on a domain of propositions. What I mean, then, is that the credal stateidentifies (1/3, 2/3) as the range of values that the probability functions in CX takewhen they are applied to the proposition in question.In this chapter, I will introduce a divide et impera argument in favour of theLaplacean position. I assume that the appeal of the Boolean position is immediatelyobvious. Not only is it psychologically implausible that agents who strive for ratio-nality should have all their credences worked out to crystal-clear precision; it seemsepistemically doubtful to assign exact credences to propositions about which theagent has little or no information, incomplete or ambiguous evidence (Joyce calls therequirement for sharp credences “psychologically implausible and epistemologically1296.2. Partial Beliefscalamitous,” see Joyce, 2005, 156). From the perspective of information theory, it ap-pears that an agent with sharp credences pretends to be in possession of informationthat she does not have.The divide et impera argument runs like this: I show that the Boolean position isreally divided into two mutually incompatible camps, Bool-A and Bool-B . Bool-A hasthe advantage of owning all the assets of appeal against the Laplacean position. Thestock examples brought to bear against sharp credences have simple and compellinginstate solutions given Bool-A. Bool-A, however, has deep conceptual problems whichI will describe in detail below.Bool-B ’s refinement of Bool-A is successful in so far as it resolves the conceptualproblems. Their success depends on what I call Augustin’s concessions, which un-dermine all the appeal that the Boolean position as a whole has over the Laplaceanposition. With a series of examples, I seek to demonstrate that in Simpson Para-dox type fashion the Laplacean position looks genuinely inferior to the amalgamatedBoolean position, but as soon as the mutually incompatible strands of the Booleanposition have been identified, the Laplacean position is independently superior toboth.6.2 Partial BeliefsBayesians, whether Booleans or Laplaceans, agree that full belief epistemology givesus an incomplete account of rationality and the epistemic landscape of the humanmind. Full belief epistemology is concerned with the acceptance and the rejection offull beliefs, whether an agent may be in possession of knowledge about their contentsand what may justify or constitute this knowledge. Bayesians engage in a comple-mentary project investigating partial beliefs. There are belief contents toward whicha rational agent has a belief-like attitude characterized by degrees of confidence.These partial beliefs are especially useful in decision theory (for example, bettingscenarios). Bayesians have developed a logic of partial beliefs, not dissimilar totraditional logic, which justifies certain partial beliefs in the light of other partialbeliefs.1306.2. Partial BeliefsSome epistemologists now seek to reconcile full and partial belief epistemology(see Spohn, 2012; Weatherson, 2012; and Moss, 2013). There is a sense in which,by linking knowledge of chances to its representation in credences, Booleans alsoseek to reconcile traditional knowledge epistemology concerned with full belief andBayesian epistemology concerned with partial belief. If the claims in this chapter arecorrect then the Boolean approach will not contribute to this reconciliation becauseit mixes full belief and partial belief metaphors in ways that are problematic. Theprimary task of Bayesians is to explain what partial beliefs are, how they work, andwhat the norms of rationality are that govern them. The problem for Booleans, aswe will see, is that only Bool-A has an explanation at hand for how partial beliefsmodel the epistemic state of the agent in tandem with the way full beliefs do theirmodeling: as we will see by example, Bool-A often looks at partial beliefs as fullbeliefs about objective chances. Bool-B debunks this approach, which leaves theproject of reconciliation unresolved.For the remainder of this section, I want to give the reader a flavour of howappealing the amalgamated version of Booleanism is (the view that a rational agentis permitted to entertain credal states that lack the precision of sharp credences)and then draw the distinction between Bool-A and Bool-B . Many recently publishedpapers confess allegiance to allowing instates without much awareness of Augustin’sand Joyce’s refinements in what I call Bool-B (for examples see Kaplan, 2010; Ha´jekand Smithson, 2012; Moss, 2013; Chandler, 2014; and Weisberg, forthcoming). Thesuperiority of the Boolean approach over the Laplacean approach is usually packagedas the superiority of the amalgamated Boolean version, even when the advantagesalmost exclusively belong to Bool-A. Advocates of Bool-B , when they defend instatesagainst the Laplacean position, often relapse into Bool-A-type argumentation, as wewill see by example in a moment.When we first hear of the advantages of instates, three of them sound particularlypersuasive.• intern Instates represent the possibility range for objective chances (objectivechances internal to the instate are not believed not to hold, objective chances1316.2. Partial Beliefsexternal to the instate are believed not to hold).• incomp Instates represent incompleteness or ambiguity of the evidence.• inform Instates are responsive to the information content of evidence.Here are some examples and explanations. Let a coinx be a Bernoulli generatorthat produces successes and failures with probability px for success, labeled Hx, and1− px for failure, labeled Tx. Physical coins may serve as Bernoulli generators, if weare willing to set aside that most of them are approximately fair.Example 24: INTERN. Blake has two Bernoulli generators in her lab, coin i andcoin ii. Blake has a database of coin i results and concludes on excellent evidence thatcoin i is fair. Blake has no evidence about the bias of coin ii. As a Boolean, Blakeassumes a sharp credence of {0.5} for Hi and an indeterminate credal state of [0, 1]for Hii. She feels bad for Logan, her Laplacean colleague, who cannot distinguishbetween the two cases and who must assign a sharp credence to both Hi and Hii (forexample, {0.5}).Example 25: INCOMP. Blake has another Bernoulli generator, coin iii, in her lab.Her graduate student has submitted coin iii to countless experiments and emails Blakethe resulting bias, but fails to include whether the bias of 2/3 is in favour of Hiii or infavour of Tiii. As a Boolean, Blake assumes an indeterminate credal state of [1/3, 2/3](or {1/3, 2/3}, depending on the convexity requirement) for Hiii. She feels bad forLogan who must assign a sharp credence to Hiii. If Logan chooses {0.5} as her sharpcredence based not unreasonably on symmetry considerations, Logan concurrentlyknows that her credence gets the bias wrong.Example 24 also serves as an example for inform: one way in which Blakefeels bad for Logan is that Logan’s {0.5} credence for Hii is based on very littleinformation, a fact not reflected in Logan’s credence. Walley notes that “the precisionof probability models should match the amount of information on which they arebased” (Walley, 1991, 34). Joyce explicitly criticizes the information overload for1326.2. Partial Beliefssharp credences in examples such as example 24. He says about sharp credences ofthis kind that, despite their maximal entropy compared to other sharp credences,they are “very informative” and “adopting [them] amounts to pretending that youhave lots and lots of information that you simply don’t have” (Joyce, 2010, 284).Walley and Joyce appeal to intuition when they promote inform. It just feelsas if there were more information in a sharp credence than in an instate. Neither ofthem ever makes this claim more explicit. Joyce admits:It is not clear how such an ‘imprecise minimum information requirement’ might beformulated, but it seems clear that C1 encodes more information than C2 wheneverC1 ⊂ C2, or when C2 arises from C1 by conditioning. (Joyce, 2010, 288.)Since for the rest of the chapter the emphasis will be on intern and incomp, Iwill advance my argument against inform right away: not only is it not clear howJoyce’s imprecise minimum information requirement might be formulated, I see noreason why it should give the results that Joyce envisions. To compare instates andsharp credences informationally, we would need a non-additive set function obeyingShannon’s axioms for information (for an excellent summary see Klir, 2006). At-tempts for such a generalized Shannon measure have been made, but they are allunsatisfactory. George Klir lists the requirements on page 235 (loc. cit.), and theyare worth calling to mind here (a similar list is in Mork, 2013, 363):(S1) Probability Consistency When D contains only one probability distribution,S assumes the form of the Shannon entropy.(S2) Set Consistency When D consists of the set of all possible probability distri-butions on A ⊆ X, then S(D) = log2 |A|.(S3) Range The range of S is [0, log2 |X|] provided that uncertainty is measured inbits.(S4) Subadditivity If D is an arbitrary convex set of probability distributions onX × Y and DX and DY are the associated sets of marginal probability distri-butions on X and Y , respectively, then S(D) ≤ S(DX) + S(DY ).1336.2. Partial Beliefs(S5) Additivity If D is the set of joint probability distributions on X × Y thatis associated with independent marginal sets of probability distributions, DXand DY , which means that D is the convex hull of the set DX ⊗ DY , thenS(D) = S(DX) + S(DY ).Several things need an explanation here (I have retained Klir’s nomenclature).D is the set of probability distributions constituting the instate. S is the proposedgeneralized Shannon measure defined on the set of possible instates. I will give anexample below in (6.2). X is the event space. DX ⊗DY is defined as follows:DX ⊗DY = {p(x, y) = pX(x) · pY (y)|x ∈ X, y ∈ Y, pX ∈ DX , pY ∈ DY }. (6.1)One important requirement not listed is that indeterminateness should give ushigher entropy (otherwise Joyce’s and Walley’s argument will fall flat). Klir’s mosthopeful contender for a generalized Shannon measure (see his equation 6.61) doesnot fulfill this requirement:S(D) = maxp∈D{−∑x∈Xp(x) log2 p(x)}. (6.2)Notice that for any convex instate there is a sharp credence contained in theinstate whose generalized Shannon measure according to (6.2) equals the generalizedShannon measure of the instate, but we would expect the entropy of the sharpcredence to be lower than the entropy of the instate if it is indeterminate. JonasClausen Mork has noticed this problem as well and proposes a modified measure toreflect that more indeterminacy ought to mean higher entropy, all else being equal.He adds the following requirements to Klir’s list above (the labels are mine; Morkcalls them NC1 and NC2 in Mork, 2013, 363):(S6) Weak Monotonicity If P is a superset of P ′, then U(P ) ≥ U(P ′). A setcontaining another has at least as great uncertainty value.1346.2. Partial Beliefs(S7) Strong Monotonicity If (i) the lower envelope of P is dominated strongly bythe lower envelope of P ′ and (ii) P is a strict superset of P ′, then U(P ) > U(P ′).When one set strictly contains another with a strictly higher lower bound forat least one hypothesis, the greater set has strictly higher uncertainty value.I have retained Mork’s nomenclature and trust that the reader can see how it linesup with Klir’s. Klir’s generalized Shannon measure (6.2) fulfills weak monotonicity,but violates strong monotonicity. Mork’s proposed alternative is the following (seeMork, 2013, 364):CSU(Π, P ) = maxp∗∈P{−n∑i=1p∗(hi) minq∗∈P{log2 q∗(hi}}. (6.3)Mork fails to establish subadditivity, however, and it is more fundamentally un-clear if Klir has not already shown with his disaggregation theory that fulfilling alldesiderata (S1)–(S7) is impossible. Some of Klir’s remarks seem to suggest this (see,for example, Klir, 2006, 218), but I was unable to discern a full-fledged impossibilitytheorem in his account. This would be an interesting avenue for further research.Against the force of intern, incomp, and inform, I maintain that the Laplaceanapproach of assigning subjective probabilities to partitions of the event space (e.g.objective chances) and then aggregating them by David Lewis’ summation formula(see Lewis, 1981, 266f) into a single precise credence function is conceptually tidyand shares many of the formal virtues of Boolean theories. If the bad taste aboutnumerical precision lingers, I will point to philosophical projects in other domainswhere the concepts we use are sharply bounded, even though our ability to conceive ofthose sharp boundaries or know them is limited (in particular Timothy Williamson’saccounts of vagueness and knowledge). To put it provocatively, this chapter defendsa 0.5 sharp credence in heads in all three cases: for a coin of whose bias we arecompletely ignorant; for a coin whose fairness is supported by a lot of evidence; andeven for a coin about whose bias we know that it is either 1/3 or 2/3 for heads.Statements by Levi and Joyce are representative of how the Boolean position is1356.2. Partial Beliefsmost commonly motivated:A refusal to make a determinate probability judgment does not derive from a lackof clarity about one’s credal state. To the contrary, it may derive from a very clearand cool judgment that on the basis of the available evidence, making a numericallydeterminate judgment would be unwarranted and arbitrary. (Levi, 1985, 395.)As sophisticated Bayesians like Isaac Levi (1980), Richard Jeffrey (1983), Mark Kaplan(1996), have long recognized, the proper response to symmetrically ambiguous orincomplete evidence is not to assign probabilities symmetrically, but to refrain fromassigning precise probabilities at all. Indefiniteness in the evidence is reflected notin the values of any single credence function, but in the spread of values across thefamily of all credence functions that the evidence does not exclude. This is why modernBayesians represent credal states using sets of credence functions. It is not just thatsharp degrees of belief are psychologically unrealistic (though they are). Imprecisecredences have a clear epistemological motivation: they are the proper response tounspecific evidence. (Joyce, 2005, 170f.)Consider therefore the following reasons that incline Booleans to permit instatesfor rational agents:(A) The greatest emphasis motivating indeterminacy rests on intern, incomp, andinform.(B) The preference structure of a rational agent may be incomplete so that repre-sentation theorems do not yield single probability measures to represent suchincomplete structures.(C) There are more technical and paper-specific reasons, such as Thomas Augustin’sattempt to mediate between the minimax pessimism of objectivists and theBayesian optimism of subjectivists using interval probability (see Augustin,2003, 35f); Alan Ha´jek and Michael Smithson’s belief that there may be ob-jectively indeterminate chances in the physical world (see Ha´jek and Smithson,1366.3. Two Camps: Bool-A and Bool-B2012, 33, but also Ha´jek, 2003, 278, 307); Jake Chandler’s claim that “the sharpmodel is at odds with a trio of plausible propositions regarding agnosticism”(Chandler, 2014, 4); and Brian Weatherson’s claim that for the Boolean posi-tion, open-mindedness and modesty may be consistent when for the Laplaceanthey are not (see Weatherson, 2015, using a result by Gordon Belot, see Belot,2013).This chapter mostly addresses (A), while taking (B) seriously as well and pointingtowards solutions for it. I am leaving (C) for more specific responses to the issuespresented in the cited articles. I will address in section 6.6 Weatherson’s more generalclaim that it is a distinctive and problematic feature of the Laplacean position “thatit doesn’t really have a good way of representing a state of indecisiveness or open-mindedness” (Weatherson, 2015, 9), i.e. that sharp credences cannot fulfill what Iwill call the double task (see section 6.6). Weatherson’s more particular claim aboutopen-mindedness and modesty is a different story and shall be told elsewhere.6.3 Two Camps: Bool-A and Bool-BMy divide et impera argument rests on the distinction between two Boolean positions.The difference is best captured by a simple example to show how epistemologistsadvocate for Bool-A or relapse into it, even when they have just advocated the morerefined Bool-B .Example 26: Skittles. Every skittles bag contains 42 pieces of candy. It is filled byrobots from a giant randomized pile of candies in a warehouse, where the ratio of fivecolours is 8:8:8:9:9, orange being the last of the five colours. Logan picks one skittlefrom a bag and tries to guess what colour it is before she looks at it. She has a sharpcredence of 9/42 that the skittle is orange.Bool-A Booleans reject Logan’s sharp credence on the basis that she does notknow that there are 9 orange skittles in her particular bag. A 9/42 credence suggeststo them a knowledge claim on Logan’s part, based on very thin evidence, that her1376.3. Two Camps: Bool-A and Bool-Bbag contains 9 orange skittles (if this example is not persuasive because the processby which the skittles are chosen is so well known and even a Bool-A Booleans shouldhave a sharp credence, the reader is welcome to complicate it to her taste—Bradleyand Steele’s example is similarly simplistic, see Bradley and Steele, 2016, 6). Logan’sdoxastic state, however, is much more complicated than her credal state. She knowsabout the robots and the warehouse. Therefore, her credences that there are k orangeskittles in the bag conform to the Bernoulli distribution:C(k) =(42k)(942)k (3342)42−k(6.4)For instance, her sharp credence that there are in fact 9 orange skittles in the bagis approximately 14.9%. One of Augustin’s concessions, the refinements that Bool-B makes to Bool-A, clarifies that a coherent Boolean position must agree with theLaplacean position that doxastic states are not fully captured by credal states. I willsee in the next section why this is the case.It is a characteristic of Bool-A, however, to require that the credal state be suf-ficient for inference, updating, and decision making. Susanna Rinard, for example,considers it the goal of instates to provide “a complete characterization of one’s dox-astic attitude” (Rinard, 2015, 5) and reiterates a few pages later that it is “a primarypurpose of the set of functions model to represent the totality of the agent’s actualdoxastic state” (Rinard, 2015, 12).I will give a few examples of Bool-A in the literature, where the authors usu-ally consider themselves to be defending an amalgamated Boolean position whichis pro toto superior to the Laplacean position. Here is an illustration in Ha´jek andSmithson.Example 27: Lung Cancer. Your doctor is your sole source of information aboutmedical matters, and she assigns a credence of [0.4, 0.6] to your getting lung cancer.Ha´jek and Smithson go on to say that1386.3. Two Camps: Bool-A and Bool-Bit would be odd, and arguably irrational, for you to assign this proposition a sharpercredence—say, 0.5381. How would you defend that assignment? You could say, Idon’t have to defend it, it just happens to be my credence. But that seems about asunprincipled as looking at your sole source of information about the time, your digitalclock, which tells that the time rounded off to the nearest minute is 4:03—and yetbelieving that the time is in fact 4:03 and 36 seconds. Granted, you may just happento believe that; the point is that you have no business doing so. (Ha´jek and Smithson,2012, 38f.)This is an argument against Laplaceans by Bool-A because it conflates partialbelief and full belief. The precise credences in Ha´jek and Smithson’s example, on anyreasonable Laplacean interpretation, do not represent full beliefs that the objectivechance of getting lung cancer is 0.5381 or that the time of the day is 4:03:36. A sharpcredence rejects no hypothesis about objective chances (unlike an instate for Bool-A). It often has a subjective probability distribution operating in the background,over which it integrates to yield the sharp credence (it would do likewise in Ha´jekand Smithson’s example for the prognosis of the doctor or the time of the day). Theintegration proceeds by Lewis’ summation formula (see Lewis, 1981, 266f),C(R) =∫ 10ζP (pi(R) = ζ) dζ. (6.5)If, for example, S is the proposition that Logan’s randomly drawn skittle in exam-ple 26 is orange, thenC(S) =42∑k=0k42(42k)(942)k (3342)42−k= 9/42. (6.6)No objective chance pi(S) needs to be excluded by it. Any updating will merelychange the partial beliefs, but no full beliefs. Instates, by giving ranges of acceptableobjective chances, suggest that there is a full belief that the objective chance does1396.3. Two Camps: Bool-A and Bool-Bnot lie outside what is indicated by the instate (corresponding to intern). When aBool-A advocate learns that an objective chance lies outside her instate, she needs toresort to belief revision rather than updating her partial beliefs. A Bool-B advocatecan avoid this by accepting one of Augustin’s concessions that I will introduce insection 6.5.Here is another quote revealing the position of Bool-A, this time by Kaplan. Theexample that Kaplan gives is in all relevant respects like example 26, except that hecontrasts two cases, one in which the composition of an urn is known and the otherwhere it is not (as the composition of Logan’s skittles bag is not known).Consider the two cases we considered earlier, and how the difference between thembears on the question as to how confident you should be that (B) the ball drawn willbe black. In the first case [where you know the composition of the urn, 100 balls, 50of which are black], it is clear why you should have a degree of confidence equal to 0.5that ball drawn from the urn will be black. Your evidence tells you that there is anobjective probability of 0.5 that the ball will be black [you only know that there arebetween 30 and 65 black balls]: it rules every other assignment out either as too lowor as too high. In the second case, however, you do not know the objective probabilitythat the ball will be black, because you don’t know exactly how many of the ballsin the urn are black. Your evidence—thus much inferior in quality to the evidenceyou have in the first case—doesn’t rule out all the assignments your evidence in thefirst case does. It rules out, as less warranted than the rest, every assignment thatgives B a value < 0.3, and every assignment that gives B a value > 0.65. But noneof the remaining assignments can reasonably thought to be any more warranted, orless warranted, by your evidence than any other. But then it would seem, at least atfirst blush, an exercise in unwarranted precision to accede to the requirement, issuedby Orthodox Bayesian Probabilism [the Laplacean position], that you choose one ofthose assignments to be your own. (Kaplan, 2010, 43f.)While most of these examples have been examples of presenting Bool-A as anamalgamated Boolean position, without heed to the refinements of Augustin and1406.4. Dilation and LearningJoyce, it is also the case that refined Booleans belonging to Bool-B relapse into Bool-A patterns when they argue against the Laplacean position. This is not surprising,because, as we will see, the refined Boolean position Bool-B is more coherent thanthe more simple Boolean position Bool-A, but also left without resources to addressthe problems that have made the Laplacean position vulnerable in the first place.Joyce, for instance, refers to an example that is again in all relevant respects likeexample 26 and states that Logan is “committing herself to a definite view aboutthe relative proportions of skittles in the bag” (see Joyce, 2010, 287, pronouns andexample-specific nouns changed to fit example 26). Augustin defends the Booleanposition with another example of relapse:Imprecise probabilities and related concepts . . . provide a powerful language which isable to reflect the partial nature of the knowledge suitably and to express the amountof ambiguity adequately. (Augustin, 2003, 34.)Augustin himself (see section 6.5 on Augustin’s concessions) details the demiseof the idea that indeterminate credal states can “express the amount of ambiguityadequately.” Before I go into these details, however, I need to make the case thatAugustin’s concessions are necessary in order to refine Bool-A and make it morecoherent. It is two problems for instates that make this case for us: dilation andthe impossibility of learning. Note that these problems are not sufficient to rejectinstates—they only compel us to refine the more simple Boolean position via Au-gustin’s concessions. The final game is between Bool-B and Laplaceans, where Iwill argue that the Bool-B position has lost the intuitive appeal of the amalgamatedBoolean position to present solutions to prima facie problems facing the Laplaceanposition.6.4 Dilation and LearningHere are two potential problems for Booleans:• dilation Instates are vulnerable to dilation.1416.4. Dilation and Learning• obtuse Instates do not permit learning.Both of these can be resolved by making Augustin’s concessions. I will introducethese problems in the present section 6.4, then Augustin’s concessions in the nextsection 6.5, and the implications for the more general disagreement between Booleansand Laplaceans in section 6.6.6.4.1 DilationConsider the following example for dilation (see White, 2010, 175f and Joyce, 2010,296f).Example 28: Dilation. Logan has two Bernoulli generators, coin iv and coinv. Shehas excellent evidence that coin iv is fair and no evidence about the bias of coinv.Logan’s graduate student independently tosses both coin iv and coinv. Then she tellsLogan whether the results of the two tosses correspond or not (Hiv ≡ Hv or Hiv ≡ Tv,where X ≡ Y means (X ∧ Y ) ∨ (qX∧qY )). Logan, who has a sharp credence for Hv,takes this information in stride, but she feels bad for Blake, whose credence in Hivdilates to [0, 1] even though Blake shares Logan’s excellent evidence that coin iv is fair.Here is why Blake’s credence in Hiv must dilate. Her credence in Hv is [0, 1], bystipulation. Let c(X) be the range of probabilities represented by Blake’s instatewith respect to the proposition X, for example c(Hv) = [0, 1]. Thenc(Hiv ≡ Hv) = c(Hiv ≡ Tv) = {0.5} (6.7)because the tosses are independent and c(Hiv) = {0.5} by stipulation. Next,c(Hiv|Hiv ≡ Hv) = c(Hv|Hiv ≡ Hv) (6.8)where c(X|Y ) is the updated instate after finding out Y . Booleans accept (6.8)because they are Bayesians and update by standard conditioning. Therefore,1426.4. Dilation and Learningc(Hiv|Hiv ≡ Hv) = c(Hv|Hiv ≡ Hv) = c(Hv) = [0, 1]. (6.9)To see that (6.9) is true, note that in the rigorous definition of a credal state as aset of probability functions, each probability function P in Blake’s instate for theproposition Hv|Hiv ≡ Hv has the following property by Bayes’ theorem:P (Hv|Hiv ≡ Hv) = P (Hiv)P (Hv)P (Hiv)P (Hv) + P (Tiv)P (Tv)(6.10)In my loose way of speaking of credal states as set-valued functions of propositions,Blake’s updated instate for Hiv has dilated from {0.5} to [0, 1] (for another exampleof dilation with discussion see Bradley and Steele, 2016).This does not sound like a knock-down argument against Booleans (it is inves-tigated in detail in Seidenfeld and Wasserman, 1993), but Roger White uses it toderive implications from instates which are worrisome.Example 29: Chocolates. Four out of five chocolates in the box have cherry fillings,while the rest have caramel. Picking one at random, what should my credence be thatit is cherry-filled? Everyone, including the staunchest [Booleans], seems to agree onthe answer 4/5. Now of course the chocolate I’ve chosen has many other features,for example this one is circular with a swirl on top. Noticing such features couldhardly make a difference to my reasonable credence that it is cherry filled (unless ofcourse I have some information regarding the relation between chocolate shapes andfillings). Often chocolate fillings do correlate with their shapes, but I haven’t thefaintest clue how they do in this case or any reason to suppose they correlate oneway rather than another . . . the further result is that while my credence that thechosen chocolate is cherry-filled should be 4/5 prior to viewing it, once I see its shape(whatever shape it happens to be) my credence that it is cherry-filled should dilateto become [indeterminate]. But this is just not the way we think about such matters.(White, 2010, 183.)1436.4. Dilation and LearningI will characterize the problems that dilation causes for the Boolean position by threeaspects (all of which originate in White, 2010):• retention This is the problem in example 29. When we tie the outcome of arandom process whose objective chance we know to the outcome of a randomprocess whose chance we do not know, White maintains that we should beentitled to retain the credence that is warranted by the known objective chance.One worry about the Boolean position in this context is that credences becomeexcessively dependent on the mode of representation of a problem (Howsoncalls this the description-relativity worry).• repetition Consider again example 28, although this time Blake runs theexperiment 10,000 times. Each time, her graduate student tells her whetherHniv ≡ Hnv or Hniv ≡ T nv , n signifying the n-th experiment. After running theexperiment that many times, approximately half of the outcomes of coin iv areheads. Now Blake runs the experiment one more time. Again, the Booleanposition mandates dilation, but should Blake not just on inductive groundspersist in a sharp credence of 0.5 for Hiv, given that about half of the coinflips so far have come up heads? Attentive readers will notice a sleight of handhere: as many times as Blake performs the experiment, it must not have anyevidential impact on the assumptions of example 28, especially that the twocoin flips remain independent and that the credal state for coinv is still, evenafter all these experiments, maximally indeterminate.• reflection Consider again example 28. Blake’s graduate student will tellBlake either Hiv ≡ Hv or Hiv ≡ Tv. No matter what the graduate student tellsBlake, Blake’s credence in Hiv dilates to an instate of [0, 1]. Blake therefore issubject to Bas van Fraassen’s reflection principle stating thatP at (A|pat+x(A) = r) = r (6.11)where P at is the agent a’s credence function at time t, x is any non-negativenumber, and pat+x(A) = r is the proposition that at time t+x, the agent a will1446.4. Dilation and Learningbestow degree r of credence on the proposition A (see van Fraassen, 1984, 244).Van Fraassen had sharp credences in mind, but it is not immediately obviouswhy the reflection principle should not also hold for instates.To address the force of retention, repetition, and reflection, Joyce ham-mers home what I have called Augustin’s concessions. According to Joyce, none ofWhite’s attacks succeed if a refinement of Bool-A’s position takes place and instatesare not required either to reflect knowledge of objective chances or doxastic states. Iwill address this in detail in section 6.5, concluding that Augustin’s concessions areonly necessary based on reflection, whereas solutions for retention and repe-tition have less far-reaching consequences and do not impugn the Boolean positionof Bool-A.6.4.2 LearningHere is an example for obtuse (see Rinard’s objection cited in White, 2010, 84 andaddressed in Joyce, 2010, 290f). It presumes Joyce’s supervaluationist semantics ofinstates (see Ha´jek, 2003; Joyce, 2010, 288; and Rinard, 2015; for a problem withsupervaluationist semantics see Lewis, 1993; and an alternative which may solvethe problem see Weatherson, 2015, 7), for which Joyce uses the helpful metaphorof committee members, each of whom holds a sharp credence. The instate consiststhen of the set of sharp credences from each committee member: for the purposesof updating, for example, each committee member updates as if she were holding asharp credence. The aggregate of the committee members’ updated sharp credencesforms the updated instate. Supervaluationist semantics also permits comparisons,when for example a partial belief in X is stronger than a partial belief in Y because allcommittee members have sharp credences in X which exceed all the sharp credencesheld by committee members with respect to Y .Example 30: Learning. Blake has a Bernoulli generator in her lab, coinvi, ofwhose bias she knows nothing and which she submits to experiments. At first, Blake’sinstate for Hvi is (0, 1). After a few experiments, it looks like coinvi is fair. However,1456.5. Augustin’s Concessionsas committee members crowd into the centre and update their sharp credences tosomething closer to 0.5, they are replaced by extremists on the fringes. The instateremains at (0, 1).It is time now to examine refinements of the Boolean position to address theseproblems.6.5 Augustin’s ConcessionsJoyce has defended instates against dilation and obtuse, making Augustin’s con-cessions (AC1) and (AC2). I am naming them after Thomas Augustin, who hassome priority over Joyce in the matter. Augustin’s concessions distinguish Bool-Aand Bool-B , the former of which does not make the concessions and identifies par-tial beliefs with full beliefs about objective chances. A sophisticated view of partialbeliefs recognizes that they are sui generis, which necessitates a substantial reconcil-iation project between full belief epistemology and partial belief epistemology. Mytask at hand is to agree with the refined position of Bool-B in their argument againstBool-A that this reconciliation project is indeed substantial and that partial beliefsare not full beliefs about objective chances; but also that Bool-B fails to summonarguments against the Laplacean position without relapsing into an unrefined versionof indeterminacy.Here, then, are Augustin’s concessions:(AC1) Credal states do not adequately represent doxastic states. The same instatecan reflect different doxastic states, even when the difference in the doxasticstates matters for updating, inference, and decision making.(AC2) Instates do not represent full belief claims about objective chances. White’sChance Grounding Thesis is not an appropriate characterization of the Booleanposition.I agree with Joyce that (AC1) and (AC2) are both necessary and sufficient to resolvedilation and obtuse for instates. I disagree with Joyce about what this means for1466.5. Augustin’s Concessionsan overall recommendation to accept the Boolean rather than the Laplacean posi-tion. After I have already cast doubt on inform, I will show that (AC1) and (AC2)neutralize intern and incomp, the major impulses for rejecting the Laplacean po-sition.Indeterminacy imposes a double task on credences (representing both uncertaintyand available evidence) that they cannot coherently fulfill. I will present severalexamples where this double task stretches instates to the limits of plausibility. Joyce’sidea that credences can represent balance, weight, and specificity of the evidence (inJoyce, 2005) is inconsistent with the use of indeterminacy. Joyce himself, in responseto dilation and obtuse, gives the argument why this is the case (see Joyce, 2010,290ff, for obtuse; and Joyce, 2010, 296ff, for dilation). Let us begin by lookingmore closely at how (AC1) and (AC2) protect Bool-B from dilation and obtuse.6.5.1 Augustin’s Concession (AC1)(AC1) says that credences do not adequately represent a doxastic state. The same in-state can reflect different doxastic states, where the difference is relevant to updating,inference, and decision making.Augustin recognizes the problem of inadequate representation before Joyce, withspecific reference to instates: “The imprecise posterior does no longer contain allthe relevant information to produce optimal decisions. Inference and decision do notcoincide any more” (Augustin, 2003, 41) (see also an example for inadequate repre-sentation of evidence by instates in Bradley and Steele, 2014, 1300). Joyce rejectsthe notion that identical instates encode identical beliefs by giving two examples.The first one is problematic. The second one, which is example 28 given earlier,addresses the issue of dilation more directly. Here is the first example.Example 31: Three-Sided Die. Suppose C′ and C′′ are defined on a partition{X,Y, Z} corresponding to the result of a roll of a three sided-die. Let C′ containall credence functions defined on {X,Y, Z} such that c(Z) ≥ 1/2, and let C′′ be thesubset of C′′ whose members also satisfy c(X) = c(Y ) (see Joyce, 2010, 294).Joyce then goes on to say,1476.5. Augustin’s ConcessionsIt is easy to show that C′ and C′′ generate the same range of probabilities for all Booleancombinations of {X,Y, Z} . . . but they are surely different: the C′′-person believeseverything the C′-person believes, but she also regards X and Y as equiprobable.Example 31 is problematic because C ′ and C ′′ do not generate the same range ofprobabilities: if, as Joyce says, c(Z) ≥ 1/2, then c(X) = c(Y ) implies c(X) ≤ 1/4for C ′′, but not for C ′. What Joyce wants to say is that the same instate can encodedoxastic states which are relevantly different when it comes to updating probabilities,and the best example for this is example 28 itself.To explain this in more detail, I need to review for a moment what Lewis meansby the Principal Principle and by inadmissibility. The Principal Principle requiresthat my knowledge of objective chances is reflected in my credence, unless there isinadmissible evidence. Inadmissible evidence would for instance be knowledge of acoin toss outcome, in which case of course I do not need to have a credence for it cor-responding to the bias of the coin. In example 28, I could use the Principal Principlein the spirit of retention to derive a contradiction to the Boolean formalism.Hiv ≡ Hv does not give anything away about Hiv, (6.12)thereforec(Hiv|Hiv ≡ Hv) = c(Hiv) (6.13)by the Principal Principle and in contradiction to (6.9).Joyce explains how (6.12) is false and blocks the conclusion (6.13), which wouldundermine the Boolean position. Hiv ≡ Hv is clearly inadmissible, even without(AC1), since it is information that not only changes Blake’s doxastic state, but alsoher credal state.We would usually expect more information to sharpen our credal states (seeWalley’s anti-dilation principle and his response to this problem in 1991, 207 and1486.5. Augustin’s Concessions299), an intuition violated by both dilation and obtuse. As far as dilation isconcerned, however, the loss of precision is in principle not any more surprising thaninformation that increases the Shannon entropy of a sharp credence.Example 32: Rumour. A rumour that the Canadian prime minister has beenassassinated raises your initially very low probability that this event is taking placetoday to approximately 50%.It is true for both sharp and indeterminate credences that information can makeus less certain about things. This is the simple solution for retention. If one ofJoyce’s committee members has a sharp credence of 1 in Hv and learns Hiv ≡ Hv,then her sharp credence for Hiv should obviously be 1 as well; ditto and mutatismutandis for the committee member who has a sharp credence of 0 in Hv. (AC1) isunnecessary.Here is how dilation is as unproblematic as a gain in entropy after more informa-tion in example 32:Example 33: Dilating Urns. You are about to draw a ball from an urn with 200balls (100 red, 100 yellow). Just before you draw, you receive the information thatthe urn has two chambers which are obscured to you as you draw the ball, one with99 red balls and 1 yellow ball, the other with 1 red ball and 99 yellow balls.Dilation from a sharp credence of {0.5} to an instate of [0.01, 0.99] (or {0.01, 0.99},depending on whether convexity is required) is unproblematic, although the exampleprefigures that there is something odd about the Boolean conceptual approach. Theexample licences a 99:1 bet for one of the colours (if the instate is interpreted asupper and lower previsions), which inductively would be a foolish move. This is aproblem that arises out of the Boolean position quite apart from dilation and inconjunction with betting licences, which I will address in example 36.So far I have not found convincing reasons to accept (AC1). Neither dilation as aphenomenon nor Lewis’ inadmissibility criterion need to compel a Bool-A advocateto admit (AC1), despite Joyce’s claims in the abstract of his paper that reactions to1496.5. Augustin’s Concessionsdilation “are based on an overly narrow conception of imprecise belief states whichassumes that we know everything there is to know about a person’s doxastic attitudesonce we have identified the spreads of values for her imprecise credences.” Not onlydoes example 31 not give us the desired results, I can also resolve retention andrepetition without recourse to (AC1). It is, in the final analysis, reflectionwhich will make the case for (AC1).It is odd that Joyce explicitly says that the repetition argument “goes awryby assuming that credal states which assign the same range of credences for an eventreflect the same opinions about that event” (Joyce, 2010, 304), when in the followinghe makes an air-tight case against the anti-Boolean force of repetition withoutever referring to (AC1). Joyce’s case against repetition is based on the sleight ofhand to which I made reference when I introduced repetition. There is no needto repeat Joyce’s argument here.Let me add that despite my agreement with Joyce on how to handle repetition,although we disagree that it has anything to do with (AC1), a worry about theBoolean position with respect to repetition lingers. Joyce requires that Blake, inso far as Blake wants to be rational, has a maximally indeterminate credal state forH10000iv after hearing from her graduate student. The oddity of this, when we havejust had about 5000 heads and 5000 tails, remains. Blake’s maximally indeterminatecredal state rests on her stubborn conviction that her credal state must be inclusiveof the objective chance of coin10000iv landing heads.Logan, if she were to do this experiment, would relax, reason inductively as wellas based on her prior probability, and give Hiv a credence of 0.5, since her credalstate is not in the same straight-jacket of underlying objective chances—just as inexample 26 Logan is able to have a sharp credence of 9/42 for picking an orangeskittle, when her credence that there are 9 orange skittles in the bag is only 14.9%.Logan’s attitude may serve as additional motivation for Augustin’s second concession(AC2). A proponent of Bool-B would rather stand with Logan and hold that therelationship between instates and objective chances is much looser than for Bool-A,but more about this in subsection 6.5.2.I am left with reflection, which bears most of the burden in Joyce’s argument1506.5. Augustin’s Concessionsfor (AC1). It is indeed odd that a rational agent should have two strictly distinctcredal states C1 and C2, when C2 follows upon C1 from a piece of information thatsays either X or qX—but it does not matter which of the two. Why does the rationalagent not assume C2 straight away? Joyce introduces an important distinction for vanFraassen’s reflection principle: It is not the degree of credence that is decisive as invan Fraassen’s original formulation of the principle, but the doxastic state. X and qX(in example 28, they refer to Hiv ≡ Hv and Hiv ≡ Tv) both lead from a sharp credenceof 0.5 to an instate of [0, 1] for Hiv, but this updated instate reflects different doxasticstates. In Joyce’s words, “the beliefs about Hiv you will come to have upon learningHiv ≡ Hv are complementary to the beliefs you will have upon learning Hiv ≡ Tv”(Joyce, 2010, 304). Committee members representing complementary beliefs agreeon the instate, but single committee members take opposite views depending on theinformation, X or qX (besides Joyce, see also Topey, 2012, 483). Credal states keeptrack only of the committee’s aggregate credal state, whereas doxastic states keeptrack of each committee member’s individual sharp credences. For Joyce, this isa purely formal distinction, not to be confused with questions about the believer’spsychology (see Joyce, 2010, 288).This resolves the anti-Boolean force of reflection by making the concession(AC1). Ironically, of course, not being able to represent a doxastic state, but onlyto reflect it inadequately, was just the problem that Laplacean sharp credences hadwhich Boolean instates were supposed to fix. One way Bool-B could respond is likethis:Riposte: If you are going to make a difference between representing a doxastic stateand reflecting a doxastic state, where the former poses an identity relationship betweencredal states and doxastic states and the latter a supervenience relationship, then allyou have succeeded in making me concede is that both instates and sharp credencesreflect doxastic states. Instates may still be more successful in reflecting doxasticstates than sharp credences are and so solve the abductive task of explaining partialbeliefs better.There are several reasons why I doubt this line of argument is successful. The1516.5. Augustin’s ConcessionsLaplacean position has formal advantages, for example that it is able to cooperatewith information theory, an ability which Bool-B lacks. There is no coherent theoryof how relations between instates can be evaluated on the basis of information andentropy, which are powerful tools when it comes to justifying fundamental Bayesiantenets (see Shore and Johnson, 1980; Giffin, 2008). Furthermore, the Laplaceanposition is conceptually tidy. It distinguishes between the quantifiable aspects of adoxastic state, which it integrates to yield the sharp credence, and other aspects ofthe evidence, such as incompleteness or ambiguity. Instates dabble in what I will callthe double task: trying to reflect both aspects without convincing success in either.6.5.2 Augustin’s Concession (AC2)(AC2) says that instates do not reflect knowledge claims about objective chances.White’s Chance Grounding Thesis (which White does not endorse, being a Laplacean)is not an appropriate characterization of the Boolean position.Chance Grounding Thesis (CGT): Only on the basis of known chances can onelegitimately have sharp credences. Otherwise one’s spread of credence should coverthe range of possible chance hypotheses left open by your evidence. (White, 2010,174)Joyce considers (AC2) to be as necessary for a coherent Boolean view of partialbeliefs, blocking obtuse, as (AC1) is, blocking dilation (see Joyce, 2010, 289f).obtuse is related to vacuity, another problem for Booleans:• vacuity If one were to be committed to the principle of regularity, that allstates of the world considered possible have positive probability (for a defencesee Edwards et al., 1963, 211); and to the solution of Henry Kyburg’s lotteryparadox, that what is rationally accepted should have probability 1 (for adefence of this principle see Douven and Williamson, 2006); and the CGT, thatone’s spread of credence should cover the range of possible chance hypothesesleft open by the evidence (implied by much of Boolean literature); then one’sinstate would always be vacuous.1526.6. The Double TaskBooleans must deny at least one of the premises to avoid the conclusion. Joyce deniesthe CGT, giving us (AC2). It is by no means necessary to sign on to regularity andto the above-mentioned solution of Henry Kyburg’s lottery paradox in order to seehow (AC2) is a necessary refinement of Bool-A. The link between objective chancesand credal states expressed in the CGT is suspect for many other reasons. I havereferred to them passim, but will not go into more detail here.6.6 The Double TaskSharp credences have a single task: to reflect epistemic uncertainty as a tool forupdating, inference, and decision making. They cannot fulfill this task withoutcontinued reference to the evidence which operates in the background. To use ananalogy, credences are not sufficient statistics with respect to updating, inference,and decision making. What is remarkable about Joyce’s response to dilation andobtuse is that Joyce recognizes that instates are not sufficient statistics either. Butthis means that they fail at the double task which has been imposed on them: torepresent both epistemic uncertainty and relevant features of the evidence.In the following, I will provide a few examples where it becomes clear that instateshave difficulty representing uncertainty because they are tangled in a double taskwhich they cannot fulfill.Example 34: Aggregating Expert Opinion. Blake has no information whether itwill rain tomorrow (R) or not except the predictions of two weather forecasters. Oneof them forecasts 0.3 on channel GPY, the other 0.6 on channel QCT. Blake considersthe QCT forecaster to be significantly more reliable, based on past experience.An instate corresponding to this situation may be [0.3, 0.6] (see Walley, 1991,214), but it will have a difficult time representing the difference in reliability of theexperts. We could try [0.2, 0.8] (since the greater reliability of QCT suggests that thechance of rain tomorrow is higher rather than lower) or [0.1, 0.7] (since the greaterreliability of QCT suggests that its estimate is more precise), but it remains obscurewhat the criteria are.1536.6. The Double TaskA sharp credence of P (R) = 0.53, for example, does the right thing. Such acredence says nothing about any beliefs that the objective chance is restricted to asubset of the unit interval, but it accurately reflects the degree of uncertainty thatthe rational agent has over the various possibilities. Beliefs about objective chancesmake little sense in many situations where we have credences, since it is doubtfuleven in the case of rain tomorrow that there is an urn of nature from which balls aredrawn. What is really at play is a complex interaction between epistemic states (forexample, experts evaluating meteorological data) and the evidence which influencesthem.As we will see in the next example, it is an advantage of sharp credences that theydo not exclude objective chances, even extreme ones, because they express partialbelief and do not suggest, as instates do for Bool-A, that there is full belief knowledgethat the objective chance is a member of a proper subset of the possibilities (foran example of a crude version of indeterminacy that reduces partial beliefs to fullbeliefs see Levi, 1981, 540, “inference derives credal probability from knowledge ofthe chances of possible outcomes”; or Kaplan, 2010, 45, “you should rule out all andonly the assignments the evidence warrants your regarding as too high or too low,and you should remain in suspense between those degree of confidence assignmentsthat are left”).Example 35: Precise Credences. Logan’s credence for rain tomorrow, based onthe expert opinion of channel GPY and channel QCT (she has no other information)is 0.53. Is it reasonable for Logan, considering how little evidence she has, to rejectthe belief that the chance of rain tomorrow is 0.52 or 0.54; or to prefer a 52.9 cent beton rain to a 47.1 cent bet on no rain?The first question in example 35 is as confused as the Bool-A confusion found inexample 27 discussed earlier. As for the second question in example 35: why wouldwe prefer a 52.9 cent bet on rain to a 47.1 cent bet on no rain, given that we do notpossess the power of discrimination between these two bets? If X and Y are twopropositions, then it is important not to confuse the claim that it is reasonable tohold both X and Y with the claim that it is reasonable to hold either X (without1546.6. The Double TaskY ) or Y (without X). It is the reasonableness of holding X and Y concurrently thatis controversial, not the reasonableness of holding Y (without holding X) when it isreasonable to hold X.Let U(S,Z, t) mean “it is rational for S to believe Z at time t.” Then Uis exportable (see Rinard, 2015, 6) if and only if U(S,X, t) and U(S, Y, t) implyU(S,X ∧ Y, t). Beliefs somehow grounded in subjectivity, such as beliefs about eti-quette or colour perception are counter-examples for the exportability of U . Vague-ness also gives us cases of non-exportability. Rinard considers the connection betweenvagueness and indeterminacy to be an argument in favour of indeterminacy.My argument is that non-exportability blunts an argument against the Laplaceanposition. In a moment, I will talk about anti-luminosity, the fact that a rational agentmay not be able to distinguish psychologically between a 54.9 cent bet on an eventand a 45.1 bet on its negation, when her sharp credence is 0.55. She must rejectone of them not to incur sure loss, so proponents of indeterminacy suggest that shechoose one of them freely without being constrained by her credal state or reject bothof them. I claim that a sharp credence will make a recommendation between the twoso that only one of the bets is rational given her particular credence, but that doesnot mean that another sharp credence which would give a different recommendationmay not also be rational for her to have. Partial beliefs are non-exportable.The answer to the second question in example 35 ties in with the issue of in-complete preference structure referred to above as motivation (B) for instates (seepage 136).It hardly seems a requirement of rationality that belief be precise (and preferencescomplete); surely imprecise belief (and corresponding incomplete preferences) are atleast rationally permissible. (Bradley and Steele, 2014, 1288, for a similar sentimentsee Kaplan, 2010, 44.)The development of representation theorems beginning with Frank Ramsey (fol-lowed by increasingly more compelling representation theorems in Savage, 1954; andJeffrey, 1965; and numerous other variants in contemporary literature) bases proba-bility and utility functions of an agent on her preferences, not the other way around.1556.6. The Double TaskOnce completeness as an axiom for the preferences of an agent is jettisoned, inde-terminacy follows automatically. Indeterminacy may thus be a natural consequenceof the proper way to think about credences in terms of the preferences that theyrepresent.In response, preferences may very well logically and psychologically precede anagent’s probability and utility functions, but that does not mean that we cannot in-form the axioms we use for a rational agent’s preferences by undesirable consequencesdownstream. Completeness may sound like an unreasonable imposition at the out-set, but if incompleteness has unwelcome consequences for credences downstream, itis not illegitimate to revisit the issue.Timothy Williamson goes through this exercise with vague concepts, showingthat all upstream logical solutions to the problem fail and that it has to be solveddownstream with an epistemic solution (see Williamson, 1996). Vague concepts, likesharp credences, are sharply bounded, but not in a way that is luminous to the agent(for anti-luminosity see chapter 4 in Williamson, 2000). Anti-luminosity answers thesecond question in example 35: the rational agent prefers the 52.9 cent bet on rain toa 47.1 cent bet on no rain based on her sharp credence without being in a position tohave this preference necessarily or have it based on physical or psychological ability(for the analogous claim about knowledge see Williamson, 2000, 95).In a way, advocates of indeterminacy have solved this problem for us. There isstrong agreement among most of them that the issue of determinacy for credencesis not an issue of elicitation (sometimes the term ‘indeterminacy’ is used insteadof ‘imprecision’ to underline this difference; see Levi, 1985, 395). The appeal ofpreferences is that we can elicit them more easily than assessments of probability andutility functions. The indeterminacy issue has been raised to the probability level(or moved downstream) by indeterminacy advocates themselves who feel justifiablyuncomfortable with an interpretation of their theory in behaviourist terms. So itshall be solved there, and this chapter makes an appeal to reject indeterminacy onthis level. The solution then has to be carried upstream (or lowered to the logicallymore basic level of preferences), where we recognize that completeness for preferencesis after all a desirable axiom for rationality and “perfectly rational agents always have1566.6. The Double Taskperfectly sharp probabilities” (Elga, 2010, 1). When Levi talks about indeterminacy,it also proceeds from the level of probability judgment to preferences, not the otherway around (see Levi, 1981, 533).Note also that Booleans customarily talk about imprecise probabilities, but sel-dom about imprecise utility. Bradley and Steele “do not anticipate any problemsin extending the results to the case of both imprecise probabilities and utilities”(Bradley and Steele, 2016, 9.), but without having investigated this any further Isuspect that there are counterintuitive implications downstream for imprecise utilityas well.Example 36: Monkey-Filled Urns. Let urn A contain 4 balls, two red and twoyellow. A monkey randomly fills urn B from urn A with two balls. We draw from urnB (a precursor to this example is in Jaynes and Bretthorst, 2003, 160).One reasonable sharp credence of drawing a red ball is 0.5, following Lewis’summation formula for the different combinations of balls in urn B and symmetryconsiderations. This solution is more intuitive in terms of further inference, decisionmaking, and betting behaviour than a credal state of {0, 1/2, 1} or [0, 1] (dependingon the convexity requirement), since this instate would licence an exorbitant bet infavour of one colour, for example one that costs $9,999 and pays $10,000 if red isdrawn and nothing if yellow is drawn.How a bet is licenced is different on various Boolean accounts. Rinard, for ex-ample, contrasts a moderate account with a liberal account (see Rinard, 2015, 7).According to the liberal account, the $9,999 bet is licenced, whereas according tothe moderate account, it is only indeterminate whether the bet is licenced. Themoderate account does not take away from the force of example 36, where it shouldbe determinate that a $9,999 bet is not licenced.To make example 36 more vivid consider a Hand Urn, where you draw by handfrom an urn with 100 balls, 50 red balls and 50 yellow balls. When your hand retreatsfrom the urn, does it not contain either a red ball or a yellow ball and so serve itselfas an urn, from which in a sense you draw a ball? Your hand contains one ball, eitherred or yellow, and the indeterminate credal state that it is one or the other should1576.6. The Double Taskbe [0, 1]. This contradicts our intuition that our credence should be a sharp 0.5 andraises again the mode of representation issue, as in retention. Sharp credencesare more resistent to partition variance when the partition is along lines that appearirrelevant to the question (such as symbols on pieces of chocolate as in example 29or how a ball is retrieved from an urn as in example 36).Example 37: Three Prisoners. Prisoner X1 knows that two out of three prisoners(X1, X2, X3) will be executed and one of them pardoned. She asks the warden of theprison to tell her the name of another prisoner who will be executed, hoping to gainknowledge about her own fate. When the warden tells her that X3 will be executed,X1 erroneously updates her probability of pardon from 1/3 to 1/2, since either X1 orX2 will be spared.Walley maintains that for the Monty Hall problem and the Three Prisoners prob-lem, the probabilities of a rational agent should dilate rather than settle on thecommonly accepted solutions. For the Three Prisoners problem, there is a com-pelling case for standard conditioning and the result that the credence for prisonerX1 to receive a pardon ought not to change after the update (see section 7.4). Wal-ley’s dilated solution would give prisoner X1 hope on the doubtful possibility (andunfounded assumption) that the warden might prefer to provide X3’s (rather thanX2’s) name in case prisoner X1 was pardoned.This example brings an interesting issue to the forefront. Sharp credences oftenreflect independence of variables where such independence is unwarranted. Booleans(more specifically, detractors of the principle of indifference or the principle of max-imum entropy, principles which are used to generate sharp credences for rationalagents) tend to point this out gleefully. They prefer to dilate over the possible de-pendence relationships, independence included. Dilation in example 28 is an instanceof this. The fallacy in the argument for instates, illustrated by the Three Prisonersproblem, is that the probabilistic independence of sharp credences does not implyindependence of variables. Only the converse is correct. Probabilistic independencemay simply reflect an averaging process over various dependence relationships, inde-pendence again included.1586.6. The Double TaskIn the Three Prisoners problem, there is no evidence about the degree or the direc-tion of the dependence, and so prisoner X1 should take no comfort in the informationthat she receives. The rational prisoner’s probabilities will reflect probabilistic inde-pendence, but make no claims about causal independence. Walley has unkind thingsto say about sharp credences and their ability to respond to evidence (for examplethat their “inferences rarely conform to evidence”, see Walley, 1991, 396), but in thiscase it appears to me that they outperform the Boolean approach.Joyce also commits himself to dilation for the Three Prisoners problem and wouldthus have to call the conventional solution defective epistemology (see Joyce, 2010,292), for in the comparable case of example 28 he states that “behaving as if thesetwo events are independent amounts to pulling a statistical correlation out of thinair!” (Joyce, 2010, 300.) The two events in question are Hiv and Hiv ≡ Hv in example28, but they could just as well be ‘X1 will be executed’ and ‘warden says X3 will beexecuted’ in example 37.To conclude, a Boolean in the light of Joyce’s two Augustinian concessions hasthree alternatives, of which I favour the third: (a) to find fault with Joyce’s reasoningas he makes those concessions; (b) to think (as Joyce presumably does) that theconcessions are compatible with the promises of Booleans, such as intern, incomp,and inform, to solve prima facie problems of sharp credences; or (c) to abandon theBoolean position because (AC1), (AC2), and an array of examples in which sharpcredences are conceptually and pragmatically more appealing show that the initialpromise of the Boolean position is not fulfilled.159Chapter 7Judy Benjamin7.1 Goldie Hawn and a Test Case for FullEmploymentProbability kinematics is the field of inquiry asking how we should update a prob-ability distribution in the light of evidence. If the evidence comes as an event, itis relatively uncontroversial to use conditional probabilities (standard conditioning).Sometimes, however, the evidence may not relate the certainty of an event but areassessment of its uncertainty or its probabilistic relation to other events (see Jef-frey, 1965, 153ff), expressible in a shift in expectation (see Hobson, 1971). Jeffreyconditionalization can deal with some of these cases, but not with all of them (seefigure 7.1). Bas van Fraassen has come up with an example for a case in whichwe cannot apply Jeffrey conditionalization. The example is from the 1980 comedyfilm Private Benjamin (see van Fraassen, 1981), in which Goldie Hawn portrays aJewish-American woman (Judy Benjamin) who joins the U.S. Army.Example 38: Judy Benjamin. In van Fraassen’s interpretation of the movie,Judy Benjamin is on an assignment and lands in a place where she is not sure of herlocation. She is on team Blue. Because of the map, initially the probability of beingin Blue territory equals the probability of being in Red territory, and the probabilityof being in the Red Second Company area equals the probability of being in the RedHeadquarters area. Her commanders then inform Judy by radio that in case she isin Red territory, her chance of being in the Red Headquarters area is three times thechance of being in the Red Second Company area.The question is what Judy’s appropriate response is to this new evidence. We1607.1. Goldie Hawn and a Test Case for Full EmploymentFigure 7.1: Information that leads to unique solutions for probabil-ity updating using pme must come in the form of an affine constraint(a constraint in the form of an informationally closed and convexspace of probability distributions consistent with the information).All information that can be processed by Jeffrey conditioning comesin the form of an affine constraint, and all information that can beprocessed by standard conditioning can also be processed by Jeffreyconditioning. The solutions of pme are consistent with the solutionsof Jeffrey conditioning and standard conditioning where the latter twoare applicable.cannot apply standard conditioning, because there is no immediately obvious eventspace in which we can condition on an event of which we are certain. Grove andHalpern (1997) have offered a proposal for constructing such event spaces and thenconditioning on the event that Judy Benjamin receives the information that shereceives from her commanders. They admit, however, that the construction of suchspaces (sometimes called retrospective conditioning) is an exercise in filling in missing1617.1. Goldie Hawn and a Test Case for Full EmploymentBlue (Second Company) Blue (Headquarters)Red (Second Company) Red (Headquarters)A1 A2A3Figure 7.2: Judy Benjamin’s map. Blue territory (A3) is friendlyand does not need to be divided into a Headquarters and a SecondCompany area.details and supplying information not contained in the original problem.If we assume that the attempt fails to define an event on which Judy Benjamincould condition her probabilities, we are left with two possibilities. Her new informa-tion (it is three times as likely to land in A2 than as to land in A1, see figure 7.2 andthe details of the problem in the next section) may mean that we have a redistribu-tion of a complete partition of the probabilities. This is called Jeffrey conditioningand calls for Jeffrey’s rule. Jeffrey’s rule is contested in some circles, but we will forthis project accept its validity in probability kinematics. We will see in what followsthat some make the case that Jeffrey conditioning is the correct way to solve theJudy Benjamin problem. For reasons provided in the body of the chapter their caseis implausible.The third possibility to solve this problem (after standard conditioning and Jef-frey conditioning) is to consult a highly contested updating procedure: the principleof maximum entropy (pme for short). pme can be applied to any situation in whichwe have a completely quantified probability distribution and an affine constraint (Iwill explain the nature of affine constraints in more detail later). If my new evidenceis the observation of an event (or simply certainty about an event that we did not1627.1. Goldie Hawn and a Test Case for Full Employmenthave previously), then the event provides an affine constraint and can be used forupdating by means of standard conditioning. If my new evidence is a redistributionof probabilities where I can apply Jeffrey’s rule, then the redistribution provides anaffine constraint and can be used for updating by means of Jeffrey’s rule. Thesetwo possibilities, however, do not exhaust affine constraints. The Judy Benjaminproblem illustrates the third possibility where the affine constraint only redistributessome groups of probabilities and leaves open the question how this will affect theprobabilities not included in this redistribution.Advocates of pme claim that in this case the probabilities should be adjusted sothat they are minimally affected (we make this precise by using information theory)while at the same time according with the constraint. Opponents of this view grantthat pme is an important tool of probability kinematics. Noting that some results ofpme are difficult to accept (such as in the Judy Benjamin case), however, they urgeus to embrace a more pluralistic, situation-specific methodology.Joseph Halpern, for example, writes in Reasoning About Uncertainty that “thereis no escaping the need to understand the details of the application” (Halpern, 2003,423) and concludes that pme is a valuable tool, but should be used with care (seeGrove and Halpern, 1997, 110), explicitly basing his remark on the counterintuitivebehaviour of the Judy Benjamin problem. Diaconis and Zabell state: “any claimsto the effect that maximum-entropy revision is the only correct route to probabilityrevision should be viewed with considerable caution” (Diaconis and Zabell, 1982,829). “Great caution” (1994, 456) is also what Colin Howson and Allan Franklinadvise about the more basic claim that the updated probabilities provided by pme areas like the original probabilities as it is possible to be given the constraints imposedby the data.In the same vein, Igor Douven and Jan-Willem Romeijn agree with RichardBradley that “even Bayes’ rule ‘should not be thought of as a universal and mechan-ical rule of updating, but as a technique to be applied in the right circumstances,as a tool in what Jeffrey terms the art of judgment.’ In the same way, determiningand adapting the weights . . . may be an art, or a skill, rather than a matter ofcalculation or derivation from more fundamental epistemic principles” (Douven and1637.1. Goldie Hawn and a Test Case for Full EmploymentRomeijn, 2009, 16) (for the Bradley quote see Bradley, 2005, 362).What is lacking in the literature is a response by pme advocates to the coun-terintuitive behaviour of the cases repeatedly quoted by their adversaries. This isespecially surprising as we are not dealing with an array of counter-examples butonly a handful, the Judy Benjamin problem being prime among them. In Halpern’stextbook, for example, the reasoning is as follows: pme is a promising candidatewhich delivers unique updated probability distributions; but, unfortunately, there iscounterintuitive behaviour in one specific case, the Judy Benjamin case (see Halpern,2003, 110, 119); therefore, we must abide by the eclectic principle of considering notonly pme, but also lower and upper probabilities, Dempster-Shafer belief functions,possibility measures, ranking functions, relative likelihoods, and so forth. The humaninquirer is the final arbiter between these conditionalization methods.At the heart of my investigation are two incompatible but independently plau-sible intuitions regarding Judy’s choice of updated probabilities for her location. Iwill undermine the notion that pme’s solution for the Judy Benjamin problem iscounterintuitive. The intuition that pme’s solution for the Judy Benjamin problemviolates (call it T1) is based on fallacious independence and uniformity assumptions.There is another powerful intuition (call it T2) that conflicts with T1 and obeyspme. Therefore, Halpern does not give us sufficient grounds for the eclecticismadvocated throughout his book. I will show that another intuitive approach, thepowerset approach, lends significant support to the solution provided by pme for theJudy Benjamin problem, especially in comparison to intuition T1, many of whoseindependence and uniformity assumptions it shares.I have no proof that pme is the only rationally defensible objective method to up-date probabilities given an affine constraint. The literature outlines many of the ‘niceproperties’ of pme. It seamlessly generalizes standard conditioning and Jeffrey’s rulewhere they are applicable (see Caticha and Giffin, 2006). It underlies the entropyconcentration phenomenon described in Jaynes’ standard work Probability Theory:the Logic of Science, which contains other arguments in favour of pme (some ofwhich you may recognize by family resemblance in the rest of this chapter). Entropyconcentration refers to the unique property of pme solution to have other distribu-1647.1. Goldie Hawn and a Test Case for Full Employmenttions which obey the affine constraint cluster around it. Shore and Johnston haveshown that under certain rationality assumptions pme provides the unique solutionto problems of probability updating (see Shore and Johnson, 1980). When used tomake predictions whose quality is measured by a logarithmic score functions, poste-rior probabilities provided by pme result in minimax optimal decisions (see Topsøe,1979, Walley, 1991, and Gru¨nwald, 2000). Under a logarithmic scoring rule theseposterior probabilities are in some sense optimal.Despite all these nice properties, I want the reader to follow me in a more simpleline of argument. When new evidence is provided to us, it is rational to adjust ourbeliefs minimally in light of it. I do not want to draw out more information fromthe new evidence than necessary. I am the first to admit that there are numerousproblems here that need addressing. What do I mean by rationality? What are thesemantics of the word ‘minimal’? What are the formal properties of such posteriorprobabilities? Are they unique? Are they compatible with other intuitive methodsof updating? Are there counter-intuitive examples that would encourage us to giveup on this line of thought rather than live with its consequences? Given some decentanswers to these questions, however, I feel that pme cuts a good figure as a firstpass to provide objective solutions to these types of problems, and the burden onopponents who usually deny that there are such objective solutions exist grows heavy.The distinctive contribution of this chapter is to show why the reasoning of theopponents of pme in the Judy Benjamin case is flawed. They make independenceassumptions that on closer inspection do not hold up. I will provide a number ofscenarios consistent with the information in the problem which violate these indepen-dence assumptions. That does not mean that the information given in the problemsuggests these scenarios, it only means that we are not entitled to make those inde-pendence assumptions. That, in turn, does not privilege pme solution, although pmedoes not lean on independence assumptions that other solutions illegitimately make.pme, however, confronts us with a much stronger claim than merely providing apassable or useful solution to the Judy Benjamin problem: it claims to much greatergenerality and, to use a term abjured by many formal epistemologists, to objectivity.These claims must be motivated elsewhere, and the nature of their normativity is a1657.2. Two Intuitionsmatter of debate (for a pragmatic approach see Caticha, 2012). I am only showingthat opponents cannot claim an easy victory by pulling out old Judy Benjamin.There is a long-standing disagreement between (mostly) philosophers on the onehand and (mostly) physicists on the other hand. The philosophers claim that up-dating probabilities is irreducibly accompanied by thoughtful deliberation with thechoice between different updating procedures depending on individual problems. Thephysicists claim that problems are ill-posed if they do not contain the informationnecessary to let a non-arbitrary, objective procedure (such as pme) arrive at a uniqueupdated probability distribution. In the literature, Judy Benjamin serves as an ex-ample widely taken to count in favour of the philosophers. It is taken to supportwhat I call the full employment theorem of probability kinematics.7.2 Two IntuitionsThere are two pieces of information relevant to Judy Benjamin when she decideson her updated probability assignment. I will call them (MAP) and (HDQ). As infigure 7.2, A1 is the Red Second Company area, A2 is the Red Headquarters area,A3 is Blue territory. Judy presumably wants to be in Blue territory, but if she is inRed territory, she would prefer their Second Company area (where enemy soldiersare not as well-trained as in the Headquarters area).(MAP) Judy has no idea where she is. Because of the map, her probability of being inBlue territory equals the probability of being in Red territory, and the proba-bility of being in the Red Second Company area equals the probability of beingin the Red Headquarters area.(HDQ) Her commanders inform Judy that in case she is in Red territory, her chanceof being in their Headquarters area is three times the chance of being in theirSecond Company area.In formal terms (sloppily writing Ai for the event of Judy being in Ai),1667.2. Two Intuitions2 · P (A1) = 2 · P (A2) = P (A3) (MAP)ϑ = P (A2|A1 ∪ A2) = 34(HDQ)(HDQ) is partial information because in contrast to the kind of evidence we areused to in Bayes’ formula (such as ‘an even number was rolled’), and to the kind ofevidence needed for Jeffrey’s rule (where a partition of the whole event space and itsprobability redistribution is required, not only A1∪A2, but see here the objections inDouven and Romeijn, 2009), the scenario suggests that Bayesian conditionalizationand Jeffrey’s rule are inapplicable. I am interested in the most defensible updatedprobability assignment(s) and will express them in the form of a normalized oddsvector (q1, q2, q3), following van Fraassen (1981). qi is the updated probability Q(Ai)that Judy Benjamin is in Ai. Let P be the probability distribution prior to thenew observation and pi the individual relatively prior probabilities. The qi sum to 1(this differs from van Fraassen’s canonical odds vector, which is proportional to thenormalized odds vector but has 1 as its first element). We definet =ϑ1− ϑ (7.1)t is the factor by which (HDQ) indicates that Judy’s chance of being in A2 is greaterthan being in A1. In Judy’s particular case, t = 3 and ϑ = 0.75. Two intuitionsguide the way people think about Judy Benjamin’s situation.T1 (HDQ) does not refer to Blue territory and should not affect P (A3): q3 = p3(=0.50).There is another, conflicting intuition (due to Peter Williams via personal commu-nication with van Fraassen, see van Fraassen, 1981, 379):1677.2. Two IntuitionsT2 If the value of ϑ approaches 1 (in other words, t approaches infinity) then q3should approach 2/3 as the problem reduces to one of ordinary conditioning.(HDQ) would turn into ‘if you are in Red territory you are almost certainly inthe Red Headquarters area.’ Considering (MAP), q3 should approach 2/3.Continuity considerations pose a contradiction to T1. (These considerations arestrong enough that Luc Bovens uses them as an assumption to solve Adam Elga’sSleeping Beauty problem by parity of reasoning in Bovens, 2010.) To parse theseconflicting intuitions, I will introduce several methods to provide G, the functionthat maps ϑ to the appropriate normalized updated odds vector (q1, q2, q3).The first method is extremely simple and accords with intuition T1: Gind(ϑ) =(0.5(1 − ϑ), 0.5ϑ, 0.5). In Judy’s particular case with t = 3 the normalized oddsvector is (ind stands for independent):Gind(0.75) = (0.125, 0.375, 0.500) (7.2)Both Grove and Halpern (1997) and Douven and Romeijn (2009) make a case forthis distribution. Grove and Halpern use standard conditioning on the event of themessage being transmitted to Judy. Douven and Romeijn use Jeffrey’s rule (becausethey believe that T1 is in this case so strong that Q(A3) = P (A3) is as much of aconstraint as (MAP) and (HDQ), yielding a Jeffrey partition).T1, however, conflicts with the symmetry requirements outlined in van Fraassenet. al. (1986). Van Fraassen introduces various updating methods which do not con-flict with those symmetry requirements, the most notable of which is pme. Shoreand Johnson have already shown that, given certain assumptions (which have beenheavily criticized, e.g. in Uffink, 1996), pme produces the unique updated probabilityassignment according with these assumptions. The minimum information discrim-ination theorem of Kullback and Leibler (see for example Csisza´r, 1967, section 3)demonstrates how Shannon’s entropy and the Kullback-Leibler Divergence formulacan provide the least informative updated probability assignment (with reference tothe prior probability assignment) obeying the constraint posed by the evidence. The1687.2. Two Intuitionsidea is to define a space of probability distributions, make sure that the constraintidentifies a closed, convex subset in this space, and then determine which of thedistributions in the closed, convex subset is least distant from the prior probabilitydistribution in terms of information (using the minimum information discriminationtheorem). It is necessary for the uniqueness of this least distant distribution thatthe subset be closed and convex (in other words, that the constraint be affine, seeCsisza´r, 1967).For Judy Benjamin, pme suggests the following normalized odds vector:Gmax(0.75) ≈ (0.117, 0.350, 0.533) (7.3)The updated probability of being on Blue territory (A3) has increased from 50% toapproximately 53%. Grove and Halpern find this result “highly counterintuitive”(Grove and Halpern, 1997, 2). Van Fraassen summarizes the worry:It is hard not to speculate that the dangerous implications of being in the enemy’sHeadquarters area are causing Judy Benjamin to indulge in wishful thinking, herindulgence becoming stronger as her conditional estimate of the danger increases.(van Fraassen, 1981, 379.)There are two ways in which I can arrive at result (7.3). I may use the constraint rulefor cross-entropy derived in subsection 3.2.3 or use the Kullback-Leibler Divergenceand differentiate it to obtain where it is minimal. The constraint rule has the ad-vantage of providing results when the derivative of the Kullback-Leibler Divergenceis difficult to find. This not being the case for Judy, I go the easier route of thesecond method here and provide a more general justification for the constraint rulein subsection 3.2.3, together with its application to the Judy Benjamin case.The Kullback-Leibler Divergence isD(Q,P ) =m∑i=1qi log2qipi. (7.4)1697.3. Epistemic EntrenchmentI fill in the explicit details from Judy Benjamin’s situation and differentiate theexpression to obtain the minimum (by setting the derivative to 0).∂∂q1(q1 log2(4q1) + tq1 log2(4tq1) + (1− (t+ 1)q1) log2 2(1− (t+ 1)q1)) = 0 (7.5)The resulting expression for Gmax isGmax(ϑ) =(11 + t+ C,t1 + t+ C,C1 + t+ C), with C = 2tt1+t . (7.6)Figures 7.3 and 7.4 show in diagram form the distribution of (q1, q2, q3) dependingon the value of ϑ (between 0 and 1), respectively following intuition T1 and pme.Notice that in accordance with intuition T2, pme provides a result where q3 → 2/3for ϑ approaching 0 or 1.7.3 Epistemic EntrenchmentConsider two future events A and B. You have partial belief in whether they willoccur and assign probabilities to them. Then you learn that A entails B. How doesthis information affect your probability assignment for event A? If A is causallyindependent of B then your updated probability for it should equal the originalprobability.Example 39: Sundowners at the Westcliff. Whether Sarah and Marian havesundowners at the Westcliff hotel tomorrow may initially not depend on the weatherat all, but if they learn that there will be a wedding and the hotel’s indoor facilitieswill be closed to the public, then rainfall tomorrow implies no sundowners.Learning this conditional does not affect the probability of the antecedent (rainfalltomorrow), because the antecedent is causally independent of the consequent.1707.3. Epistemic Entrenchment0.00.20.40.60.81.00.0 0.2 0.4 0.6 0.8 1.0q1q2q3Figure 7.3: Judy Benjamin’s updated probability assignment ac-cording to intuition T1. 0 < ϑ < 1 forms the horizontal axis, thevertical axis shows the updated probability distribution (or the nor-malized odds vector) (q1, q2, q3). The vertical line at ϑ = 0.75 showsthe specific updated probability distribution Gind(0.75) for the JudyBenjamin problem.Example 40: Heist. A jeweler has been robbed, and Kate has reason to assume thatHenry might be the robber. Kate knows that he is not capable of actually injuringanother person, but he may very well engage in robbery. When Kate hears from theinvestigator that the robber also shot the jeweler she concludes that Henry is not therobber. (A similar example is in Leendert Huisman’s paper “Learning from SimpleIndicative Conditionals,” to be published in Erkenntnis, involving an exam and askiing vacation.)Kate has learned a conditional and adjusted the probability for the antecedent.1717.3. Epistemic Entrenchment0.00.20.40.60.81.00.0 0.2 0.4 0.6 0.8 1.0q1q2q3Figure 7.4: Judy Benjamin’s updated probability assignment usingpme. 0 < ϑ < 1 forms the horizontal axis, the vertical axis showsthe updated probability distribution (or the normalized odds vector)(q1, q2, q3). The vertical line at ϑ = 0.75 shows the specific updatedprobability distribution Gmax(0.75) for the Judy Benjamin problem.The reason for this is that Kate was epistemically entrenched to uphold her beliefin Henry’s nonviolent nature. Updating probabilities upon learning a conditionaldepends on epistemic entrenchments. Examples 39 and 40 are from Douven andRomeijn, 2009).In Judy Benjamin’s case, (HDQ) is also a conditional. If Judy is in Red ter-ritory, she is more likely to be in the Headquarters area. According to pme, theupdated probability for the antecedent of this conditional is raised. It appears thatpme preempts epistemic entrenchments and nolens volens assigns a certain degreeof confirmation to the antecedent of a learned conditional. This degree of confirma-1727.3. Epistemic Entrenchmenttion depends on the causal dependency of the antecedent on the consequent. In thissection I will focus on the independence assumptions that are improperly importedinto the Judy Benjamin case by detractors of pme. I will be particularly critical ofDouven and Romeijn, who hold that the Judy Benjamin case is a case for Adam’sconditioning, where the antecedent is left alone in the updating of probabilities.Even though T1 is an understandably strong intuition, it does not take intoaccount that the information given to Judy by her commanders may indeed be de-pendent on whether she is in Blue or in Red territory. To underline this objection tointuition T1 consider three scenarios, any of which may form the basis of the partialinformation provided by her commanders.I Judy is dropped off by a pilot who flips two coins. If the first coin lands H,then Judy is dropped off in Blue territory, otherwise in Red territory. If thesecond coin lands H, she is dropped off in the Headquarters area, otherwise inthe Second Company area. Judy’s commanders find out that the second coinis biased ϑ : 1 − ϑ toward H with ϑ = 0.75. The normalized odds vector isGI(0.75) = (0.125, 0.375, 0.500) and agrees with T1, because the choice of Blueor Red is completely independent from the choice of the Red Headquarters areaor the Red Second Company area.II The pilot randomly lands in any of the four quadrants and rolls a die. If sherolls an even number, she drops off Judy. If not, she takes her to another (orthe same, the choice happens with replacement) randomly selected quadrant torepeat the procedure. Judy’s commanders find out, however, that for A1, thepilot requires a six to drop off Judy, not just an even number. The normalizedodds vector in this scenario is GII(0.75) = (0.1, 0.3, 0.6) and does not accordwith T1.III Judy’s commanders have divided the map into 24 congruent rectangles, A3into twelve, and A1 and A2 into six rectangles each (see figures 7.5 and 7.6).They have information that the only subsets of the 24 rectangles in which JudyBenjamin may be located are such that they contain three times as many A21737.3. Epistemic Entrenchmentrectangles than A1 rectangles. The normalized odds vector in this scenario isGIII(0.75) ≈ (.108, .324, .568) (evaluating almost 17 million subsets).I–III demonstrate the contrast between scenarios when independence is true andwhen it is not. Douven and Romeijn’s capital mistake in their paper is that theyassume that the Judy Benjamin problem is analogous to example 39. Sarah, however,knows that whether it rains or not is independent of her activity the next night,whereas in Judy Benjamin we have no evidence of such independence, as scenario IImakes clear. This is not to say that scenario II is the scenario that pertains in JudyBenjamin’s case. It only says that there is no natural assumption in Judy Benjamin’scase that the probabilities are independent of each other in light of the new evidence,for scenario II is perfectly natural (whether it is true or not is a completely differentquestion) and reveals how dependence is consistent with the information that JudyBenjamin receives.Douven and Romeijn’s strong independence claim relying on intuition T1 leadsthem to apply Jeffrey’s rule to the Judy Benjamin problem with the additionalconstraint q3 = p3. They claim that in most cases “the learning of a conditional isor would be irrelevant to one’s degree of belief for the conditional’s antecedent . . .the learning of the relevant conditional should intuitively leave the probability of theantecedent unaltered” (Douven and Romeijn, 2009, 9).This, according to Douven and Romeijn, is the usual epistemic entrenchment andapplies in full force to the Judy Benjamin problem. They give an example wherethe epistemic entrenchment could go the other way and leave the consequent ratherthan the antecedent unaltered (Kate and Henry, see Douven and Romeijn, 2009, 13).The idea of epistemic entrenchment is at odds with pme and seems to imply justwhat the full employment theorem claims: judgments so framed “will depend on thejudgmental skills of the agent, typically acquired not in the inductive logic class butby subject specific training” (Bradley, 2005, 349). To pursue the relation betweenthe semantics of conditionals, pme, and the full employment theorem would take ustoo far afield at present and shall be undertaken elsewhere. For the Judy Benjaminproblem, it is not clear why Douven and Romeijn think that the way the problem is1747.4. Coarsening at Randomposed implies a strong epistemic entrenchment for Adams conditioning (Adams con-ditioning is the kind of conditioning that will leave the antecedent alone). ScenariosII-III provide realistic alternatives where Adams conditioning is inappropriate.Judy Benjamin may also receive (HDQ) because her informers have found outthat Red Headquarters troops have occupied the entire Blue territory (q1 = 3p1, q2 =p2, q3 = 0, the epistemic entrenchment is with respect to q2); because they havefound out that Blue troops have occupied two-thirds of the Red Second Companyarea (q1 = p1, q2 = (1/3)p2, q3 = (4/3)p3, the epistemic entrenchment is with respectto q1); or because they have found out that Red Headquarters troops have takenover half of the Red Second Company area (q1 = (1/2)p1, q2 = (3/2)p2, q3 = p3,the epistemic entrenchment is with respect to q3 and what Douven and Romeijntake to be an assumption in the wording of the problem). There is nothing inthe problem that supports Douven and Romeijn’s narrowing of the options. Thetable reiterates these options, with the last line representing intuition (T1) and theepistemic entrenchment defended by Douven and Romeijn.Epistemic entrenchment q1 q2 q3with respect to A1 1/4 3/4 0with respect to A2 1/12 1/4 2/3with respect to A3 1/8 3/8 1/27.4 Coarsening at RandomAnother at first blush forceful argument that pme’s solution for the Judy Benjaminproblem is counterintuitive has to do with coarsening at random, or CAR for short.CAR involves using more naive (or coarse) event spaces in order to arrive at solutionsto probability updating problems. The CAR condition requires that conditioning onthese coarse events does not give a result that is inconsistent with conditioning on amore finely-grained event that has been observed. The mechanics are spelled out in(2003). Gru¨nwald and Halpern see a parallel between the Judy Benjamin problemand Martin Gardner’s Three Prisoners problem, see example 37 on page 158.According to Gru¨nwald and Halpern, for problems of this kind (Judy Benjamin,1757.4. Coarsening at RandomThree Prisoners, Monty Hall) there are naive and sophisticated spaces to which wecan apply probability updates. If A uses the naive space, for example, he comes tothe following conclusion: of the three possibilities that (A,B), (A,C), or (B,C) areexecuted, the warden’s information excludes (A,C). (A,B) and (B,C) are left over,and because A has no information about which one of these is true his chance of notbeing executed is 0.5. His chance of survival has increased from one third to onehalf.Gru¨nwald and Halpern show, correctly, that the application of the naive spaceis illegitimate because the CAR condition does not hold. More generally, Gru¨nwaldand Halpern show that updating on the naive space rather than the sophisticatedspace is legitimate for event type observations always when the set of observationsis pairwise disjoint or, when the events are arbitrary, only when the CAR conditionholds. For Jeffrey type observations, there is a generalized CAR condition whichapplies likewise. For affine constraints on which we cannot use Jeffrey conditioning(or, a fortiori, standard conditioning) maxent “essentially never gives the rightresults” (Gru¨nwald and Halpern, 2003, 243).Gru¨nwald and Halpern conclude that “working with the naive space, while anattractive approach, is likely to give highly misleading answers” (246), especially inthe application of pme to naive spaces as in the Judy Benjamin case “where applying[pme] leads to paradoxical, highly counterintuitive results” (245). For the ThreePrisoners problem, Jaynes’ constraint rule would supposedly proceed as follows: thevector of prior probabilities for (A,B), (A,C), and (B,C) is (1/3, 1/3, 1/3). Theconstraint is that the probability of (A,C) is zero, and a simple application of theconstraint rule yields (1/2, 0, 1/2) for the vector of updated probabilities. The CARcondition for the naive space does not hold, therefore the result is misleading.By analogy, using the constraint rule on the naive space for the Judy Benjaminproblem yields (0.117, 0.350, 0.533), but as the CAR condition fails in even the sim-plest settings for affine constraints (“CAR is (roughly speaking) guaranteed not tohold except in ‘degenerate’ situations” (251), emphasis in the original), it certainlyfails for the Judy Benjamin problem, for which constructing a sophisticated spaceis complicated (see Grove and Halpern, 1997, where the authors attempt such a1767.4. Coarsening at Randomconstruction by retrospective conditioning).The analogy, however, is misguided. The constraint rule has been formally shownto generalize Jeffrey conditioning, which in turn has been shown to generalize stan-dard conditioning (the authors admit as much in Gru¨nwald and Halpern, 2003, 262).I can solve both the Monty Hall problem and the Three Prisoners problem by stan-dard conditioning, not using the naive space, but simply using the correct space forthe probability update. For the Three Prisoners problem, for example, the wardenwill say either ‘B’ or ‘C’ in response to A’s inquiry. Because A has no informationthat would privilege either answer the probability that the warden says ‘B’ and theprobability that the warden says ‘C’ equal each other and therefore equal 0.5. Here isthe difference between using the naive space and using the correct space, but eitherway using standard conditional probabilities:P (‘A is pardoned’|‘B will be executed’) =P (‘A is pardoned’)P (‘A is pardoned’)+P (‘C is pardoned’) =12(incorrect)(7.7)P (‘A is pardoned’|‘warden says B will be executed’) =P (‘A is pardoned’ and ‘warden says B will be executed’)P (‘warden says B will be executed’) =1/61/2= 13(correct)(7.8)Why is the first equation incorrect and the second one correct? Information theorygives us the right answer: in the first equation, we are conditioning on a watereddown version of the evidence (watered down in a way that distorts the probabilitiesbecause we are not ‘coarsening at random’). ‘Warden says B will be executed’ issufficient but not necessary for ‘B will be executed.’ The former proposition is moreinformative than the latter proposition (its probability is lower). Conditioning onthe latter proposition leaves out relevant information contained in the wording of theproblem.Because pme always agrees with standard conditioning, pme gives the correctresult for the Three Prisoners problem. For the Judy Benjamin problem, there isno defensible sophisticated space and no watering down of the evidence in what1777.5. The Powerset ApproachGru¨nwald and Halpern call the ‘naive’ space. The analogy between the Three Pris-oners problem and the Judy Benjamin problem as it is set up by Gru¨nwald andHalpern fails because of this crucial difference. A successful criticism would be di-rected at the construction of the ‘naive’ space: this is what I just accomplished forthe Three Prisoners problem. There is no parallel procedure for the Judy Benjaminproblem. The ‘naive’ space is all we have, and pme is the appropriate tool to dealwith this lack of information.7.5 The Powerset ApproachIn this section, I will focus on scenario III and consider what happens when the grainof the partition becomes finer. I call this the powerset approach. Two remarks arein order. On the one hand, the powerset approach has little independent appeal.The reason behind using pme is that we want evidence to have just the right influ-ence on updated probabilities, i.e. neither over-inform nor under-inform. There isno corresponding reason why I should update my probabilities using the powersetapproach. On the other hand, what the powerset approach does is lend support toanother approach. In this task, it is persuasive because it tells us what would happenif I were to divide the event space into infinitesimally small, uniformly weighed, andindependent ‘atomic’ bits of information.In the process of arriving at the formal result, the powerset approach resemblesan empirical experiment. I am making many assumptions favouring T1, but whenthe result comes in it supports T2 in astonishingly non-trivial ways. The powersetapproach provides support for pme against T1 because it combines the assump-tions grounding T1 with a limit construction and still yields a solution that closelyapproximates the one generated by pme rather than the one generated by T1.On its own the powerset approach is just what Gru¨nwald and Halpern call a naivespace, for which CAR does not hold. Hence the powerset approach will not give usa precise solution for the problem, although it may with some plausibility guideus in the right direction—especially if despite all its independence and uniformityassumptions it significantly disagrees with intuition T1.1787.5. The Powerset ApproachLet us assume a partition of the Blue and Red territories into sets of equal measure(this is the division into rectangles of scenario III). (MAP) dictates that the numberof sets covering A3 equals the number of sets covering A1∪A2. Initially, any subset ofthis partition is a candidate for Judy Benjamin to consider. The constraint imposedby (HDQ) is that now I only consider subsets for which there are three times asmany partition sets (or rectangles, although I am not necessarily limiting myself torectangles) in A2 as there are in A1. In figures 7.5 and 7.6 there are diagrams of twosubsets. One of them (figure 7.5) is not a candidate, the other one (figure 7.6) is.A3 A1A2Figure 7.5: This choice of rectangles is not a candidate becausethe number of rectangles in A2 is not a t-multiple of the number ofrectangles in A1, here with s = 2, t = 3 as in scenario III.Let X be the random variable that corresponds to the ratio of the number of par-tition elements (rectangles) that are in A3 to the total number of partition elements(rectangles) for a randomly chosen candidate. We would now anticipate that theexpectation of X (which I will call EX) gives us an indication of the updated prob-ability that Judy is in A3 (so EX ≈ q3). The powerset approach is often superior tothe uniformity approach (Grove and Halpern use uniformity, with all the necessaryqualifications): if you have played Monopoly, you will know that the frequencies for1797.5. The Powerset ApproachA3 A1A2Figure 7.6: This choice of rectangles is a candidate because thenumber of rectangles in A2 is a t-multiple of the number of rectanglesin A1, here with s = 2, t = 3 as in scenario III.rolling a 2, a 7, or a 10 with two dice tend to conform more closely to the binomialdistribution (based on a powerset approach) rather than to the uniform distributionwith P (rolling i) = 1/11 for i = 2, . . ., 12.Appendix E provides a formula for the powerset approach corresponding to theformula for the pme approach, giving us q3 dependent on t. Notice that this formulais for t = 2, 3, 4, . . .. For t = 1 use the Chu-Vandermonde identity to find thatEX12 = (t+ 1)∑si=1 i(tsi)(tsti)∑si=0(tsi)(tsti) = (t+ 1)s2(7.9)and consequently EX = 1/2, as one would expect. For t = 1/2, 1/3, 1/4, . . . I cansimply reverse the roles of A1 and A2. These results give us Gpws and a graph of thenormalized odds vector (see figure 7.7), a bit bumpy around the middle because thet-values are discrete and farther apart in the middle, as t = ϑ/(1 − ϑ). Comparingthe graphs of the normalized odds vector under Grove and Halpern’s uniformityapproach (Gind), Jaynes’ pme approach (Gmax), and the powerset approach suggested1807.5. The Powerset Approachin this chapter (Gpws), it is clear that the powerset approach supports pme.0.00.20.40.60.81.00.0 0.2 0.4 0.6 0.8 1.0q1q2q3Figure 7.7: Judy Benjamin’s updated probability assignment ac-cording to the powerset approach. 0 < ϑ < 1 forms the horizontalaxis, the vertical axis shows the updated probability distribution (orthe normalized odds vector) (q1, q2, q3). The vertical line at ϑ = 0.75shows the specific updated probability distribution Gpws for the JudyBenjamin problem.Going through the calculations, it seems at many places that the powerset ap-proach should give its support to Grove and Halpern’s uniformity approach in keepingwith intuition T1. It is unexpected to find out that in the mathematical analysis theparameters converge to a non-trivial factor and do not tend to negative or positiveinfinity. Most surprisingly, the powerset approach, prima facie unrelated to an ap-proach using information, supports the idea that a set of events about which nothingis known (such as A3) gains in probability in the updated probability distribution1817.5. The Powerset Approachcompared to the set of events about which something is known (such as A1 and A2),even if it is only partial information. Unless independence is specified, as in example39, the area of ignorance gains compared to the area of knowledge.I now have several ways to characterize Judy’s updated probabilities and updatedprobabilities following upon partial information in general. Only one of them, theuniformity approach, violates van Fraassen, Hughes, and Harman’s five symmetryrequirements in (1986) and intuition T2. The uniformity approach, however, is theonly one that satisfies intuition T1, an intuition which most people have when theyfirst hear the story.Two arguments attenuate the position of the uniformity approach in comparisonwith the others. First, T1 rests on an independence assumption which is not reflectedin the problem. Although there is no indication that what Judy’s commanders tellher is in any way dependent on her probability of being in Blue territory, it is notexcluded either (see scenarios II and III earlier in this chapter). pme takes thisuncertainty into consideration. Second, when we investigate the problem using thepowerset approach it turns out that a division into equally probable, independent,and increasingly fine bits of information supports not intuition T1 but rather intuitionT2. pme, for now, is vindicated. We need to look for full employment not by cleverlymanipulating prior probabilities, but by making fresh observations, designing betterexperiments, and partitioning the theory space more finely.182Chapter 8ConclusionIt is one goal of formal epistemology to provide an account of normativity for par-tial beliefs. Bayesian epistemology, for example, defends the view that a rationalagent updates relatively prior probabilities to posterior probabilities using standardconditioning if the event on which the agent conditions has a non-zero prior probabil-ity. If this probability is zero we get the Borel-Kolmogorov paradox (for a thoroughtreatment see the soon-to-be-published article “Conditioning using Conditional Ex-pectations: The Borel-Kolmogorov Paradox” by Zala´n Gyenis, Miklo´s Re´dei, andGa´bor Hofer-Szabo´). In this dissertation, I have exclusively addressed finite eventspaces so that the Borel-Kolmogorov paradox is not an issue. It is an open questionwhether Gyenis, Redei, and Hofer-Szabo´’s treatment provides us with a compactLebesgue space that is a natural extension of the geometry of reason. In personalcommunication, Redei speculated that this may be the case. Extending pme to caseswhere the cardinality of the event space is no longer finite would need to face thesequestions.Bayesian epistemology is controversial even where the event space is finite and theprobability of the event on which the posterior probability is conditioned is non-zero.It is also the most substantial normative account available for updating probabilities.Accounts that either underdetermine it (by claiming that it requires too much) oroverdetermine it (by claiming that it requires too little) are generally in the shadowof the Bayesian account: they either take a few things or add a few things, but theyseldom claim the kind of overarching explanatory scope and formal unity inherent inthe Bayesian method.Often this is interpreted to be a good thing: epistemology, after all, is a messyaffair, and so it is not surprising that the account with the most formal unity is not183Chapter 8. Conclusionthe account that best accords with our intuitions, for that would just be a strangecoincidence. There is now a compendium of rules and methods, some of which gobeyond Bayesian epistemology while others question its general validity. At thesame time, a few formal accounts have emerged which rival Bayesian epistemologyin substance: AGM theory and ranking theory, which address full beliefs ratherthan partial beliefs but have the potential of informing partial belief epistemology aswell; Dempster-Shafer theory and possibility theory, which seek to address situationswhere Bayesian epistemology seems to give the wrong answers.This dissertation defends a formal account which also rivals Bayesian epistemol-ogy in scope and unity, but incorporates it by generalizing its requirements. WhereasDempster-Shafer theory and possibility theory are alternatives to Bayesian epistemol-ogy, information theory agrees with the Bayesian method on all questions that comewithin the purview of both. What distinguishes information theory from Bayesianepistemology is that information theory is normative where Bayesian epistemologyis silent and that information theory has a story of why the Bayesian method worksso well. For it turns out that standard conditioning just is the conditioning thatminimizes the cross-entropy between the relatively prior probability distribution andthe posterior probability distribution. Information theory can now go further: Jef-frey conditioning also minimizes this cross-entropy, and so does, quite obviously, theconstraint rule for cross-entropy in cases where standard conditioning and Jeffreyconditioning do not apply.After the introduction in chapter 1 and an explanation of information theory andentropy in chapter 2, I introduce the partial belief model and its formal propertiesusing information theory in chapter 3. After chapter 3, there is a series of conflictsbetween information theory and current epistemological accounts. The first suchconflict is between information theory and the geometry of reason in chapter 4. Manyformal epistemologists now seek ways in which normativity about partial beliefs canbe grounded in epistemic virtue. In science, we naturally follow the maxim to beas close as possible to the truth. For partial beliefs, however, the grounding virtueswere often thought to be pragmatic or psychological in nature. James Joyce’s normof gradational accuracy has provided a formal tool to make partial beliefs about184Chapter 8. Conclusionepistemic virtue.Unfortunately, the project has also relied heavily on a Euclidean metric for thedifference between partial beliefs. Leitgeb and Pettigrew have correctly pointed outhow this must lead to a rejection of Jeffrey conditioning as a natural extensionof standard conditioning. Chapter 4 makes the case that given the choice betweenJeffrey conditioning and the Euclidean metric, it is the Euclidean metric which shouldhave to go. The story ends with a bitter note and with a sweet note, but most of allwith a formidable research project: The bitter note is that not only the geometry ofreason violates several strong epistemic intuitions, but so does information theory.The sweet note is that there is hope that Joyce’s norm of gradational accuracy willdo just as well if it is not built on a Euclidean metric. I am hopeful that both ofthese open questions can be resolved, and I suspect that the theory of differentialmanifolds has the tools to resolve them for us.The next conflict is in chapter 5 between a natural extension of Jeffrey condition-ing, which I have called Wagner conditioning, and information theory. According toWagner, this natural extension is inconsistent with information theory. I have foundthat his claims are correct, but only if we accept that rational agents sometimes haveindeterminate credal states. Considering that the rationality of indeterminate credalstates are now well-established in the Bayesian community, this would be a substan-tial problem for information theory. In chapter 6, I make the case that given thechoice to reject information theory or indeterminate credal states, we should rejectindeterminate credal states. The reasons are independent of information theory (onecould also reject both information theory and indeterminate credal states). Indeter-minate credal states, as attractive as they look at first glance, exact a heavy pricefor the solutions they provide to address the problems of sharp credences.Finally, in chapter 7, I address a counterexample to information theory that is of-ten invoked by opponents as undermining the case for information theory as a generalproblem solver in partial belief kinematics. The Judy Benjamin case is instructivein different ways, clarifying also the relationship between information theory andepistemic entrenchment, a concept originally tied to AGM belief revision but nowincreasingly also applied to partial beliefs; and the relationship between information185Chapter 8. Conclusiontheory and independence. Information theory appears to favour independence, oftenbecause independence is the middle ground between dependence relationships lean-ing one way or another. Having a bias with respect to dependence would provideinformation that is often not warranted. Information theory therefore settles on in-dependence, without however informing that a relationship is indeed independent.This ambiguity between lack of commitment to one or another way of dependenceand independence is something that information theory must, and does, navigatecarefully.Now that the work is done, it remains to spell out where the open questions andthe future research projects are. I already made reference to two of them: (1) Itis imperative for information theory to give an account of the violations in List Bin subsection 4.2.3. (2) It is imperative for information theory to give a positiveaccount of what follows from Joyce’s norm of gradational accuracy once we strip itof its Euclidean prejudice.To these two projects I add the following: (3) The precision-modesty paradoxbriefly mentioned in section 1.2 has significant potential to address a vexing question:the more modest and nuanced we are in our beliefs, the more information we appearto have about the propositions in question. I am interested in what partial beliefsare vis-a`-vis full beliefs. Are partial beliefs just a particular kind of full belief? Dofull beliefs and partial beliefs coordinate to produce the doxastic state of an agent?Or should we separate partial beliefs from belief psychology and interpret them asrepresentative of an agent’s preferences? An answer to this question may inform asolution to the precision-modesty paradox, and vice versa.(4) While I was working on indeterminate credal states, I found out that an agentusing lower and upper previsions (provided by indeterminate credal states) may sys-tematically do better accepting and rejecting bets than an agent who uses sharpcredences. Walley conducted an experiment in which this appeared to be true, bet-ting on soccer games played in the Soccer World Cup 1982 in Spain (see Walley, 1991,Appendix I). I replicated the experiment using two computer players with rudimen-tary artificial intelligence and made them specify betting parameters (previsions) forgames played in the Soccer World Cup 2014 in Brazil. I used the Poisson distribution186Chapter 8. Conclusion(which is an excellent predictor for the outcome of soccer matches) and the FIFAranking to simulate millions of counterfactual World Cup results and their associatedbets, using Walley’s evaluation method. The player using upper and lower previsionshad a slight but systematic advantage over the player using sharp credences. I wouldlove to see the preliminary data explained by analytic expressions of the respectivegains by the two players and a successful defence of sharp credences given this atfirst glance embarrassing state of affairs. I have an idea of how this could be done.It would also be interesting to see how an explanatory account of these dynamicscorresponds to the literature in economics on prediction markets and stock markettrading.(5) There are two impossibility theorems which would be of great interest if theycould be established. The first one concerns desiderata S1-7 in section 6.2 for aninformation measure that has as its domain indeterminate rather than sharp credalstates. Currently there is no known measure which fulfills these desiderata, butthere is also no proof that it does not exist. If it did exist, it could undermine myclaim that in terms of conservation of information, indeterminate credal states donot outperform sharp credences, even though this claim is often made. The otherimpossibility theorem concerns difference measures of probability distributions. Ihave listed some desiderata in List A and List B in subsection 4.2.3. It is currentlynot clear if there is a difference measure that fulfills all of these desiderata, or if thereis a proof that no such measure exists. I would also be interested in a measure whichsupports Jeffrey conditioning while not committing all the violations of informationtheory in List B. Leitgeb and Pettigrew suggest that non-Euclidean but symmetricdifference measures may have interesting properties, but I have not seen this avenuepursued in the literature.Finally, a word about what motivated this dissertation on a more philosophicallevel. I used to feel kinship with some of the idealists in the Brigadas Internacionalesof the Spanish Civil War. It was not because I shared any of their lived reality, butbecause I, as they did, believed in a guiding hand in history and I, as they were, waseventually disappointed in this belief. Karl Popper gives a systematic account of thepoverty of historicism in a book carrying that title, based on personal experience187Chapter 8. Conclusionat a street battle in the Ho¨rlstraße in Vienna, where he overheard fellow Marxistsendangering the lives of their friends in order to hurry along a revolution which theyfelt assured would come one way or another. In his Third Essay on the Genealogyof Morals, Nietzsche puts it nicely:Regarding nature as though it were a proof of God’s goodness and providence; in-terpreting history in honour of divine reason, as a constant testimonial to an ethicalworld order and ethical ultimate purpose; explaining all one’s own experiences in theway pious folk have done for long enough, as though everything were providence, asign, intended, and sent for the salvation of the soul: now all that is over.I prefer the clockwork mechanism of information theory over the probing geniusof epistemologists because the probing genius of epistemologists has at its founda-tion an optimism about the calibration of our epistemological sensitivities that I donot share. My work in formal epistemology has a larger philosophical point in mindwhich more recognizably matters than a dissertation on Bayesian updating and max-imum entropy: epistemological pessimism, the view that the increasing irrelevance ofmetaphysical questions in the 20th century is matched by an increasing irrelevanceof epistemological questions now. Not only do we not know what there is, we do notknow what we know. Humans are living beings, not knowing beings. If knowledgedefines them, it does so in the sense that humans are deceived beings. Surroundedby metaphysical and epistemological illusions, humans are by their constitution de-luded. Their delusions are what makes them interesting and tragic, creative andhumorous.Skeptics require that we know nothing at all, whereas epistemological pessimistsobserve that we may know a little, but whatever it is we know we don’t know thatwe know it, and it is not much in comparison to what we could know or what is outthere to know for creatures that are better at knowing than we are. Epistemologicalpessimism is the view that recognizes that we do not drive with a good view ofour surroundings. We are driving a car in the fog, at full speed, with no access tothe brakes and little access to information about what is going on, while constantlyhaving to make life-changing decisions.188Chapter 8. ConclusionSo I end with Laplace’s last words, according to Joseph Fourier’s account pro-nounced with difficulty: “Ce que nous connaissons est peu de chose, ce que nous ig-norons est immense” (Joseph Fourier, E´loge historique de M. le Marquis de Laplace,delivered 15 June 1829).189BibliographyAlchourro´n, Carlos, Peter Ga¨rdenfors, and David Makinson. “On the Logic of The-ory Change: Partial Meet Contraction and Revision Functions.” The Journal ofSymbolic Logic 50, 2: (1985) 510–530.Amari, Shun-ichi. Differential-Geometrical Methods in Statistics. Berlin, Germany:Springer, 1985.Amari, Shun-ichi, and Hiroshi Nagaoka. Methods of Information Geometry. Provi-dence, RI: American Mathematical Society, 2000.Anton, Howard, and Robert Busby. Contemporary Linear Algebra. New York, NY:Wiley, 2003.Armendt, Brad. “Is There a Dutch Book Argument for Probability Kinematics?”Philosophy of Science 47, 4: (1980) 583–588.Atkinson, David. “Confirmation and Justification.” Synthese 184, 1: (2012) 49–61.Augustin, Thomas. “On the Suboptimality of the Generalized Bayes Rule and RobustBayesian Procedures from the Decision Theoretic Point of View—a CautionaryNote on Updating Imprecise Priors.” In ISIPTA. 2003, volume 3, 31–45.Bar-Hillel, Yehoshua, and Rudolf Carnap. “Semantic information.” The BritishJournal for the Philosophy of Science 4, 14: (1953) 147–157.Belot, Gordon. “Bayesian Orgulity.” Philosophy of Science 80, 4: (2013) 483–503.Bertsekas, Dimitri. Constrained Optimization and Lagrange Multiplier Methods.Boston, MA: Academic, 1982.190BibliographyBolker, Ethan. “Functions Resembling Quotients of Measures.” Transactions of theAmerican Mathematical Society 124, 2: (1966) 292–312.. “A Simultaneous Axiomatization of Utility and Subjective Probability.”Philosophy of Science 34, 4: (1967) 333–340.Boltzmann, Ludwig. On the Nature of Gas Molecules. Taylor and Francis, 1877.Boole, George. An Investigation of the Laws of Thought: On Which Are Foundedthe Mathematical Theories of Logic and Probabilities. Walton and Maberly, 1854.Bovens, Luc. “Judy Benjamin is a Sleeping Beauty.” Analysis 70, 1: (2010) 23–26.Bradley, Richard. “Radical Probabilism and Bayesian Conditioning.” Philosophy ofScience 72, 2: (2005) 342–364.Bradley, Seamus, and Katie Steele. “Uncertainty, Learning, and the Problem ofDilation.” Erkenntnis 79, 6: (2014) 1287–1303.. “Can Free Evidence Be Bad? Value of Information for the Imprecise Prob-abilist.” Philosophy of Science 83, 1: (2016) 1–28.Buck, B., and V.A. Macaulay. Maximum Entropy in Action: A Collection of Expos-itory Essays. New York: Clarendon, 1991.Calin, Ovidiu, and Constantin Udriste. Geometric Modeling in Probability and Statis-tics. Heidelberg, Germany: Springer, 2014.van Campenhout, Jan, and Thomas Cover. “Maximum Entropy and ConditionalProbability.” Information Theory, IEEE Transactions on 27, 4: (1981) 483–489.Carnap, Rudolf. The Continuum of Inductive Methods. University of Chicago, 1952.. Logical Foundations of Probability. University of Chicago, 1962.Carnap, Rudolf, and Yehoshua Bar-Hillel. An Outline of a Theory of SemanticInformation. Cambridge, MA: MIT, 1952.191BibliographyCaticha, Ariel. “Entropic Inference and the Foundations of Physics.” BrazilianChapter of the International Society for Bayesian Analysis .Caticha, Ariel, and Adom Giffin. “Updating Probabilities.” In MaxEnt 2006, the 26thInternational Workshop on Bayesian Inference and Maximum Entropy Methods.2006.Chandler, Jake. “Subjective Probabilities Need Not Be Sharp.” Erkenntnis 79, 6:(2014) 1273–1286.Chentsov, Nikolai. Statistical Decision Rules and Optimal Inference. Providence,R.I: American Mathematical Society, 1982.Chino, Naohito. “A Graphical Technique for Representing the Asymmetric Rela-tionships Between N Objects.” Behaviormetrika 5, 5: (1978) 23–44.. “A Generalized Inner Product Model for the Analysis of Asymmetry.” Be-haviormetrika 17, 27: (1990) 25–46.Chino, Naohito, and Kenichi Shiraiwa. “Geometrical Structures of Some Non-Distance Models for Asymmetric MDS.” Behaviormetrika 20, 1: (1993) 35–47.Christensen, David. “Measuring Confirmation.” The Journal of Philosophy 96, 9:(1999) 437–461.Coombs, Clyde H. A Theory of Data. New York, NY: Wiley, 1964.Cover, T.M., and J.A. Thomas. Elements of Information Theory, volume 6. Hoboken,NJ: Wiley, 2006.Cox, Richard. “Probability, Frequency and Reasonable Expectation.” AmericanJournal of Physics 14: (1946) 1.Coxon, Anthony. The User’s Guide to Multidimensional Scaling. Exeter, NH: Heine-mann Educational Books, 1982.192BibliographyCrupi, Vincenzo, and Katya Tentori. “State of the Field: Measuring Information andConfirmation.” Studies in History and Philosophy of Science Part A 47: (2014)81–90.Crupi, Vincenzo, Katya Tentori, and Michel Gonzalez. “On Bayesian Measures ofEvidential Support: Theoretical and Empirical Issues.” Philosophy of Science 74,2: (2007) 229–252.Csisza´r, Imre. “Information-Type Measures of Difference of Probability Distributionsand Indirect Observations.” Studia Scientiarum Mathematicarum Hungarica 2:(1967) 299–318.Csisza´r, Imre, and Paul C Shields. Information Theory and Statistics: A Tutorial.Hanover, MA: Now Publishers, 2004.Cyranski, John F. “Measurement, Theory, and Information.” Information and Con-trol 41, 3: (1979) 275–304.De Finetti, B. “Sul Significato Soggettivo della Probabilita`.” Fundamenta mathe-maticae 17, 298-329: (1931) 197.De Finetti, Bruno. “La pre´vision: ses lois logiques, ses sources subjectives.” In An-nales de l’institut Henri Poincare´. Presses universitaires de France, 1937, volume 7,1–68.Debbah, Me´rouane, and Ralf Mu¨ller. “MIMO Channel Modeling and the Principleof Maximum Entropy.” IEEE Transactions on Information Theory 51, 5: (2005)1667–1690.Dempster, Arthur. “Upper and Lower Probabilities Induced by a Multivalued Map-ping.” The Annals of Mathematical Statistics 38, 2: (1967) 325–339.Diaconis, Persi, and Sandy Zabell. “Updating Subjective Probability.” Journal ofthe American Statistical Association 77, 380: (1982) 822–830.193BibliographyDias, P., and A. Shimony. “A Critique of Jaynes’ Maximum Entropy Principle.”Advances in Applied Mathematics 2: (1981) 172–211.Domotor, Zoltan. “Probability Kinematics and Representation of Belief Change.”Philosophy of Science 47, 3: (1980) 384–403.Do¨ring, Frank. “Why Bayesian Psychology Is Incomplete.” Philosophy of Science66: (1999) S379–S389.Douven, I., and J.W. Romeijn. “A New Resolution of the Judy Benjamin Problem.”CPNSS Working Paper 5, 7: (2009) 1–22.Douven, Igor, and Timothy Williamson. “Generalizing the Lottery Paradox.” BritishJournal for the Philosophy of Science 57, 4: (2006) 755–779.Earman, John. Bayes or Bust? Cambridge, MA: MIT, 1992.Edwards, Ward, Harold Lindman, and Leonard J. Savage. “Bayesian StatisticalInference for Psychological Research.” Psychological Review 70, 3: (1963) 193.Elga, Adam. “Subjective Probabilities Should Be Sharp.” Philosophers’ Imprint 10,5: (2010) 1–11.Ellsberg, Daniel. “Risk, Ambiguity, and the Savage Axioms.” The Quarterly Journalof Economics 75, 4: (1961) 643–669.Kampe´ de Fe´riet, J., and B. Forte. “Information et probabilite´.” Comptes rendus del’Acade´mie des sciences A 265: (1967) 110–114.Festa, R. “Bayesian Confirmation.” In Experience, Reality, and Scientific Expla-nation: Essays in Honor of Merrilee and Wesley Salmon, edited by Merrilee H.Salmon, Maria Carla Galavotti, and Alessandro Pagnini, Dordrecht: Kluwer, 1999,55–88.Fitelson, Branden. “A Bayesian Account of Independent Evidence with Applica-tions.” Philosophy of Science 68, 3: (2001) 123–140.194Bibliography. “Logical Foundations of Evidential Support.” Philosophy of Science 73, 5:(2006) 500–512.Foley, Richard. Working Without a Net: A Study of Egocentric Epistemology. NewYork, NY: Oxford University Press, 1993.Forster, Malcolm, and Elliott Sober. “How to Tell When Simpler, More Unified,or Less Ad Hoc Theories Will Provide More Accurate Predictions.” The BritishJournal for the Philosophy of Science 45, 1: (1994) 1–35.van Fraassen, Bas. “A Problem for Relative Information Minimizers in ProbabilityKinematics.” The British Journal for the Philosophy of Science 32, 4: (1981)375–379.. “Belief and the Will.” Journal of Philosophy 81, 5: (1984) 235–256.. Laws and Symmetry. Oxford, UK: Clarendon, 1989.. “Symmetries of Probability Kinematics.” In Bridging the Gap: Philosophy,Mathematics, and Physics, Springer, 1993, 285–314.van Fraassen, Bas, R.I.G. Hughes, and Gilbert Harman. “A Problem for RelativeInformation Minimizers, Continued.” The British Journal for the Philosophy ofScience 37, 4: (1986) 453–463.Friedman, Kenneth, and Abner Shimony. “Jaynes’s Maximum Entropy Prescriptionand Probability Theory.” Journal of Statistical Physics 3, 4: (1971) 381–384.Gage, Douglas W, and David Hestenes. “Comment on the Paper Jaynes’s MaximumEntropy Prescription and Probability Theory.” Journal of Statistical Physics 7,1: (1973) 89–90.Gaifman, Haim. “Subjective Probability, Natural Predicates and Hempel’s Ravens.”Erkenntnis 14, 2: (1979) 105–147.Gemes, Ken. “Verisimilitude and Content.” Synthese 154, 2: (2007) 293–306.195BibliographyGentleman, R., B. Ding, S. Dudoit, and J. Ibrahim. “Distance Measures in DNAMicroarray Data Analysis.” In Bioinformatics and Computational Biology Solu-tions Using R and Bioconductor, edited by R. Gentleman, V. Carey, W. Huber,R. Irizarry, and S. Dudoit, Springer, 2006.Gibbs, Josiah Willard. Elementary Principles in Statistical Physics. New Haven,CT: Yale University, 1902.Giffin, Adom. Maximum Entropy: The Universal Method for Inference. PhD dis-sertation, University at Albany, State University of New York, Department ofPhysics, 2008.Gillies, Donald. Philosophical Theories of Probability. London, UK: Routledge, 2000.Goldstein, Michael. “The Prevision of a Prevision.” Journal of the American Statis-tical Association 78, 384: (1983) 817–819.Good, Irving. Good Thinking: The Foundations of Probability and Its Applications.Minneapolis, MN: University of Minnesota, 1983.Good, John Irving. Probability and the Weighing of Evidence. London, UK: Griffin,1950.Greaves, Hilary, and David Wallace. “Justifying Conditionalization: Conditionaliza-tion Maximizes Expected Epistemic Utility.” Mind 115, 459: (2006) 607–632.Grove, A., and J.Y. Halpern. “Probability Update: Conditioning Vs. Cross-Entropy.”In Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelli-gence. 1997, 208–214.Gru¨nwald, Peter. “Maximum Entropy and the Glasses You Are Looking Through.”In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence.Morgan Kaufmann, 2000, 238–246.Gru¨nwald, Peter, and Joseph Halpern. “Updating Probabilities.” Journal of Artifi-cial Intelligence Research 19: (2003) 243–278.196BibliographyGuias¸u, Silviu. Information Theory with Application. New York, NY: McGraw-Hill,1977.Gyenis, Zala´n, and Miklo´s Re´dei. “Defusing Bertrand’s Paradox.” The BritishJournal for the Philosophy of Science 66, 2: (2015) 349–373.Hacking, Ian. “Slightly More Realistic Personal Probability.” Philosophy of Science34, 4: (1967) 311–325.Ha´jek, Alan. “What Conditional Probability Could Not Be.” Synthese 137, 3:(2003) 273–323.Ha´jek, Alan, and James Joyce. “Confirmation.” In Routledge Companion to the Phi-losophy of Science, edited by S. Psillos, and M. Curd, New York, NY: Routledge,2008, 115–129.Ha´jek, Alan, and Michael Smithson. “Rationality and Indeterminate Probabilities.”Synthese 187, 1: (2012) 33–48.Halpern, Joseph. “A Counterexample to Theorems of Cox and Fine.” Journal ofArtifcial Intelligence Research 10: (1999) 67–85.. Reasoning About Uncertainty. Cambridge, MA: MIT, 2003.Harshman, Richard, and Margaret Lundy. “The PARAFAC Model for Three-WayFactor Analysis and Multidimensional Scaling.” In Research methods for mul-timode data analysis, edited by Henry G. Law, New York, NY: Praeger, 1984,122–215.Harshman, Richard A., Paul E. Green, Yoram Wind, and Margaret E. Lundy. “AModel for the Analysis of Asymmetric Data in Marketing Research.” MarketingScience 1, 2: (1982) 205–242.Hempel, Carl G., and Paul Oppenheim. “A Definition of Degree of Confirmation.”Philosophy of science 12, 2: (1945) 98–115.197BibliographyHobson, A. Concepts in Statistical Mechanics. New York, NY: Gordon and Beach,1971.Hobson, Arthur. “The Interpretation of Inductive Probabilities.” Journal of Statis-tical Physics 6, 2: (1972) 189–193.Howson, Colin, and Allan Franklin. “Bayesian Conditionalization and ProbabilityKinematics.” The British Journal for the Philosophy of Science 45, 2: (1994)451–466.Howson, Colin, and Peter Urbach. Scientific Reasoning: The Bayesian Approach,Third Edition. Chicago: Open Court, 2006.Huisman, Leendert. “On Indeterminate Updating of Credences.” Philosophy ofScience 81, 4: (2014) 537–557.Huttegger, Simon. “Merging of Opinions and Probability Kinematics.” The Reviewof Symbolic Logic 8: (2015) 611–648.Ingarden, R. S., and K. Urbanik. “Information Without Probability.” ColloquiumMathematicum 9: (1962) 131–150.Jaynes, E.T. “Information Theory and Statistical Mechanics.” Physical Review 106,4: (1957a) 620–630.. “Information Theory and Statistical Mechanics II.” Physical Review 108,2: (1957b) 171.. “The Well-Posed Problem.” Foundations of Physics 3, 4: (1973) 477–492.. “Where Do We Stand on Maximum Entropy.” In The Maximum EntropyFormalism, edited by R.D. Levine, and M. Tribus, Cambridge, MA: MIT, 1978,15–118.. “Some Random Observations.” Synthese 63, 1: (1985) pp. 115–138.198Bibliography. “Optimal Information Processing and Bayes’s Theorem: Comment.” TheAmerican Statistician 42, 4: (1988) 280–281.. Papers on Probability, Statistics and Statistical Physics. Dordrecht:Springer, 1989.Jaynes, E.T., and G.L. Bretthorst. Probability Theory: the Logic of Science. Cam-bridge, UK: Cambridge University, 2003.Jeffrey, Richard. The Logic of Decision. New York, NY: McGraw-Hill, 1965.. “Dracula Meets Wolfman: Acceptance Versus Partial Belief.” In Induction,Acceptance and Rational Belief, edited by Marshall Swain, Springer, 1970, 157–185.. “Axiomatizing the Logic of Decision.” In Foundations and Applications ofDecision Theory, Springer, 1978, 227–231.. “Bayesianism with a Human Face.” Minnesota Studies in the Philosophy ofScience 10: (1983) 133–156.Jeffreys, Harold. Scientific Inference. Cambridge University, 1931.. The Theory of Probability. Cambridge University, 1939.Joyce, James. “A Nonpragmatic Vindication of Probabilism.” Philosophy of Science65, 4: (1998) 575–603.. The Foundations of Causal Decision Theory. Cambridge University, 1999.. “How Probabilities Reflect Evidence.” Philosophical Perspectives 19, 1:(2005) 153–178.. “The Development of Subjective Bayesianism.” In Handbook of the Historyof Logic 10, edited by D. M. Gabbay, S. Hartmann, and J. Woods, Amsterdam,Netherlands: Elsevier, 2009, 415–476.199Bibliography. “A Defense of Imprecise Credences in Inference and Decision Making.”Philosophical Perspectives 24, 1: (2010) 281–323.Kaplan, Mark. “In Defense of Modest Probabilism.” Synthese 176, 1: (2010) 41–55.Karmeshu, J. Entropy Measures, Maximum Entropy Principle and Emerging Appli-cations. New York: Springer-Verlag, 2003.Kelly, T. “The Epistemic Significance of Disagreement.” In Oxford Studies in Episte-mology 1, edited by Tamar Gendler, and John Hawthorne, New York, NY: OxfordUniversity, 2008, 167–196.Keynes, John Maynard. A Treatise on Probability. Diamond, 1909.. A Treatise on Probability. London, UK: Macmillan, 1921.Khinchin, A.I. Mathematical Foundations of Information Theory. New York, NY:Dover, 1957.Klir, George. Uncertainty and Information: Foundations of Generalized InformationTheory. Hoboken, NJ: Wiley, 2006.Kolmogorov, A.N. “Logical Basis for Information Theory and Probability Theory.”IEEE Transactions on Information Theory 14, 5: (1968) 662–664.Kopperman, Ralph. “All Topologies Come from Generalized Metrics.” AmericanMathematical Monthly 95, 2: (1988) 89–97.Kullback, Solomon. Information Theory and Statistics. Dover Publications, 1959.Kullback, Solomon, and Richard Leibler. “On Information and Sufficiency.” TheAnnals of Mathematical Statistics 22, 1: (1951) 79–86.Kyburg, Henry. “Recent Work in Inductive Logic.” In Recent Work in Philosophy,edited by Tibor Machan, and Kenneth Lucey, Totowa, NJ: Rowman and Allanheld,1983, 87–150.200Bibliography. “Book Review: Betting on Theories by Patrick Maher.” Philosophy ofScience 62, 2: (1995) 343.Landes, Ju¨rgen, and Jon Williamson. “Objective Bayesianism and the MaximumEntropy Principle.” Entropy 15, 9: (2013) 3528–3591.Lange, Marc. “Is Jeffrey Conditionalization Defective by Virtue of Being Non-Commutative?” Synthese 123, 3: (2000) 393–403.Leitgeb, Hannes, and Richard Pettigrew. “An Objective Justification of BayesianismI: Measuring Inaccuracy.” Philosophy of Science 77, 2: (2010a) 201–235.. “An Objective Justification of Bayesianism II: The Consequences of Mini-mizing Inaccuracy.” Philosophy of Science 77, 2: (2010b) 236–272.Levi, Isaac. “Probability Kinematics.” The British Journal for the Philosophy ofScience 18, 3: (1967) 197–209.. “Direct Inference and Confirmational Conditionalization.” Philosophy ofScience 48, 4: (1981) 532–552.. “Imprecision and Indeterminacy in Probability Judgment.” Philosophy ofScience 52, 3: (1985) 390–409.. “The Demons of Decision.” The Monist 70, 2: (1987) 193–211.Levinstein, Benjamin Anders. “Leitgeb and Pettigrew on Accuracy and Updating.”Philosophy of Science 79, 3: (2012) 413–424.Lewis, David. “A Subjectivist’s Guide to Objective Chance.” In Ifs: Conditionals,Belief, Decision, Chance and Time, edited by William Harper, Robert Stalnaker,and G.A. Pearce, Springer, 1981, 267–297.. “Many, But Almost One.” In Ontology, Causality, and Mind: Essays onthe Philosophy of D.M. Armstrong, edited by Keith Campbell, Bacon Bacon, andLloyd Reinhardt, Cambridge, UK: Cambridge University, 1993, 23–38.201Bibliography. “Why Conditionalize.” In Papers in Metaphysics and Epistemology: Volume2, Cambridge University, 2010, 403–407.Lukits, Stefan. “The Principle of Maximum Entropy and a Problem in ProbabilityKinematics.” Synthese 191, 7: (2014b) 1409–1431.. “Maximum Entropy and Probability Kinematics Constrained by Condition-als.” Entropy Special Issue ”Maximum Entropy Applied to Inductive Logic andReasoning”: (2015) forthcoming.MacKay, David. Information Theory, Inference and Learning Algorithms. Cam-bridge, UK: Cambridge, 2003.Maher, Patrick. “Diachronic Rationality.” Philosophy of Science 120–141.. Betting on Theories. Cambridge University, 1993.Majern´ık, Vladimı´r. “Marginal Probability Distribution Determined by the Maxi-mum Entropy Method.” Reports on Mathematical Physics 45, 2: (2000) 171–181.Mikkelson, Jeffrey. “Dissolving the Wine/Water Paradox.” The British Journal forthe Philosophy of Science 55, 1: (2004) 137–145.Miller, David. “A Geometry of Logic.” In Aspects of Vagueness, edited by HeinzSkala, Settimo Termini, and Enric Trillas, Dordrecht, Holland: Reidel, 1984, 91–104.Milne, Peter. “log[P(h/eb)/P(h/b)] Is the One True Measure of Confirmation.”Philosophy of Science 63, 1: (1996) 21–26.. “Information, Confirmation, and Conditionals.” Journal of Applied Logic12, 3: (2014) 252–262.von Mises, Richard. Mathematical Theory of Probability and Statistics. New York,NY: Academic, 1964.202BibliographyMork, Jonas Clausen. “Uncertainty, Credal Sets and Second Order Probability.”Synthese 190, 3: (2013) 353–378.Mormann, Thomas. “Geometry of Logic and Truth Approximation.” Poznan Studiesin the Philosophy of the Sciences and the Humanities 83, 1: (2005) 431–454.Moss, Sarah. “Epistemology Formalized.” Philosophical Review 122, 1: (2013) 1–43.Neapolitan, Richard E, and Xia Jiang. “A Note of Caution on Maximizing Entropy.”Entropy 16, 7: (2014) 4004–4014.Nozick, Robert. Philosophical Explanations. Cambridge, MA: Harvard University,1981.Oddie, Graham. “The Content, Consequence and Likeness Approaches to Verisimil-itude: Compatibility, Trivialization, and Underdetermination.” Synthese 190, 9:(2013) 1647–1687.van Ommen, Sandrien, Petra Hendriks, Dicky Gilbers, Vincent van Heuven, andCharlotte Gooskens. “Is Diachronic Lenition a Factor in the Asymmetry in Intel-ligibility Between Danish and Swedish?” Lingua 137: (2013) 193–213.Paris, Jeff. The Uncertain Reasoner’s Companion: A Mathematical Perspective,volume 39. Cambridge, UK: Cambridge University, 2006.Paris, Jeff, and Alena Vencovska´. “A Note on the Inevitability of Maximum Entropy.”International Journal of Approximate Reasoning 4, 3: (1990) 183–223.Pettigrew, Richard. “Epistemic Utility and Norms for Credences.” Philosophy Com-pass 8, 10: (2013) 897–908.Popper, Karl. Conjectures and Refutations. London, UK: Routledge and KeganPaul, 1963.Ramsey, F.P. “Truth and Probability.” In Foundations of Mathematics and otherEssays, edited by R. B. Braithwaite, London, UK: Kegan, Paul, Trench, Trubner,and Company, 1926, 156–198.203BibliographyRinard, Susanna. “A Decision Theory for Imprecise Probabilities.” Philosopher’sImprint 15, 7: (2015) 1–16.Rosenkrantz, R.D. “Bayesian Confirmation: Paradise Regained.” British Journalfor the Philosophy of Science 45, 2: (1994) 467–476.Rowbottom, Darrell. “The Insufficiency of the Dutch Book Argument.” StudiaLogica 87, 1: (2007) 65–71.Saburi, S., and N. Chino. “A Maximum Likelihood Method for an Asymmetric MDSModel.” Computational Statistics & Data Analysis 52, 10: (2008) 4673–4684.Salmon, W.C. The Foundations of Scientific Inference. Pittsburgh, PA: Universityof Pittsburgh, 1967.Salmon, Wesley. “Confirmation and Relevance.” In Induction, Probability, and Con-firmation, edited by Grover Maxwell, and Robert Milford Anderson, Minneapolis,MI: University of Minnesota Press, 1975, 3–36.Savage, Leonard. The Foundations of Statistics. New York, NY: Wiley, 1954.Schlesinger, George. “Measuring Degrees of Confirmation.” Analysis 55, 3: (1995)208–212.Seidenfeld, Teddy. “Why I Am Not an Objective Bayesian; Some ReflectionsPrompted by Rosenkrantz.” Theory and Decision 11, 4: (1979) 413–440.. “Entropy and Uncertainty.” In Advances in the Statistical Sciences: Foun-dations of Statistical Inference, Springer, 1986, 259–287.Seidenfeld, Teddy, Mark Schervish, and Joseph Kadane. “When Fair Betting OddsAre Not Degrees of Belief.” In Proceedings of the Biennial Meeting of the Philos-ophy of Science Association. JSTOR, 1990, 517–524.Seidenfeld, Teddy, and Larry Wasserman. “Dilation for Sets of Probabilities.” TheAnnals of Statistics 1139–1154.204BibliographySen, Amartya. “Choice Functions and Revealed Preference.” The Review of Eco-nomic Studies 38, 3: (1971) 307–317.Shannon, C. E. “A Mathematical Theory of Communication.” The Bell SystemTechnical Journal 27, 1: (1948) 379–423, 623–656.Shannon, Claude Elwood. “A Mathematical Theory of Communication.” ACMSIGMOBILE Mobile Computing and Communications Review 5, 1: (2001) 3–55.Shimony, Abner. “The Status of the Principle of Maximum Entropy.” Synthese 63,1: (1985) 35–53.. The Search for a Naturalistic World View. Cambridge University, 1993.Shogenji, Tomoji. “The Degree of Epistemic Justification and the Conjunction Fal-lacy.” Synthese 184, 1: (2012) 29–48.Shore, John, and Rodney Johnson. “Axiomatic Derivation of the Principle of Maxi-mum Entropy and the Principle of Minimum Cross-Entropy.” IEEE Transactionson Information Theory 26, 1: (1980) 26–37.Skilling, John. “The Axioms of Maximum Entropy.” In Maximum-Entropy andBayesian Methods in Science and Engineering, edited by G.J. Erickson, and C.R.Smith, Dordrecht, Holland: Springer, 1988, 173–187.Skyrms, Brian. “Maximum Entropy Inference as a Special Case of Conditionaliza-tion.” Synthese 63, 1: (1985) 55–74.. “Dynamic Coherence.” In Advances in the Statistical Sciences: Foundationsof Statistical Inference, Springer, 1986, 233–243.. “Coherence.” In Scientific Inquiry in Philosophical Perspective, edited byN. Rescher, Lanham, MD: University Press of America, 1987a, 225–242.. “Updating, Supposing, and Maxent.” Theory and Decision 22, 3: (1987b)225–246.205BibliographySober, Elliott. “No Model, No Inference: A Bayesian Primer on the Grue Problem.”In Grue!: The New Riddle of Induction, edited by Douglas Frank Stalker, Chicago:Open Court, 1994, 225–240.Spohn, Wolfgang. The Laws of Belief: Ranking Theory and Its Philosophical Appli-cations. Oxford University, 2012.Stephens, Christopher. “When Is It Selectively Advantageous to Have True Beliefs?Sandwiching the Better Safe Than Sorry Argument.” Philosophical Studies 105,2: (2001) 161–189.Talbott, W. J. “Two Principles of Bayesian Epistemology.” Philosophical Studies62, 2: (1991) 135–150.Teller, Paul. “Conditionalization and Observation.” Synthese 26, 2: (1973) 218–258.. “Conditionalization, Observation, and Change of Preference.” In Founda-tions of Probability Theory, Statistical Inference, and Statistical Theories of Sci-ence, Dordrecht: Reidel, 1976.Tikochinsky, Y, NZ Tishby, and RD Levine. “Alternative Approach to Maximum-Entropy Inference.” Physical Review A 30, 5: (1984) 2638.Tobler, Waldo. Spatial Interaction Patterns. Schloss Laxenburg, Austria: Interna-tional Institute for Applied Systems Analysis, 1975.Topey, Brett. “Coin Flips, Credences and the Reflection Principle.” Analysis 72, 3:(2012) 478–488.Topsøe, F. “Information-Theoretical Optimization Techniques.” Kybernetika 15, 1:(1979) 8–27.Tribus, Myron, and Hector Motroni. “Comments on the Paper Jaynes’s MaximumEntropy Prescription and Probability Theory.” Journal of Statistical Physics 4,2: (1972) 227–228.206BibliographyTversky, Amos. “Features of Similarity.” Psychological Review 84, 4: (1977) 327–352.Uffink, Jos. “Can the Maximum Entropy Principle Be Explained as a ConsistencyRequirement?” Studies in History and Philosophy of Science 26, 3: (1995) 223–261.. “The Constraint Rule of the Maximum Entropy Principle.” Studies inHistory and Philosophy of Science 27, 1: (1996) 47–79.Wagner, Carl. “Probability Kinematics and Commutativity.” Philosophy of Science69, 2: (2002) 266–278.Wagner, Carl G. “Generalized Probability Kinematics.” Erkenntnis 36, 2: (1992)245–257.Walley, Peter. Statistical Reasoning with Imprecise Probabilities. Chapman and HallLondon, 1991.Weatherson, Brian. “Knowledge, Bets and Interests.” In Knowledge Ascriptions,edited by Jessica Brown, and Mikkel Gerken, Oxford, UK: Oxford University,2012, 75–103.. “For Bayesians, Rational Modesty Requires Imprecision.” Ergo 2, 20:(2015) 1–16.Weisberg, Jonathan. “You’ve Come a Long Way, Bayesians.” Journal of Philosoph-ical Logic 1–18, forthcoming.White, Roger. “Evidential Symmetry and Mushy Credence.” In Oxford Studies inEpistemology 3, edited by Tamar Gendler, and John Hawthorne, New York, NY:Oxford University, 2010, 161–186.Wiener, Norbert. “The Ergodic Theorem.” Duke Mathematical Journal 5, 1: (1939)1–18.207Williams, P. M. “Bayesian Conditionalisation and the Principle of Minimum Infor-mation.” British Journal for the Philosophy of Science 31, 2: (1980) 131–144.Williamson, Jon. “Objective Bayesianism, Bayesian Conditionalisation and Volun-tarism.” Synthese 178, 1: (2011) 67–85.Williamson, Timothy. Vagueness. New York, NY: Routledge, 1996.. Knowledge and Its Limits. Oxford, UK: Oxford University, 2000.Zabell, Sandy L. Symmetry and its Discontents: Essays on the History of InductiveProbability. Cambridge, UK: Cambridge University, 2005.Zalabardo, Jose´. “An Argument for the Likelihood-Ratio Measure of Confirmation.”Analysis 69, 4: (2009) 630–635.Zellner, Arnold. “Optimal Information Processing and Bayes’s Theorem.” The Amer-ican Statistician 42, 4: (1988) 278–280.Zubarev, D.N., V. Morozov, and G. Roepke. Non-Equilibrium Statistical Thermody-namics. Whoknows, 1974.208Appendix AWeak Convexity and Symmetry inInformation GeometryUsing information theory instead of the geometry of reason, Joyce’s result still stands,vindicating probabilism on epistemic merits rather than prudential ones: partialbeliefs which violate probabilism are dominated by partial beliefs which obey it, nomatter what the facts are.Joyce’s axioms, however, will need to be reformulated to accommodate asymme-try. This appendix shows that the axiom Weak Convexity (see section 4.2) still holdsin information geometry. Consider three points Q,R, S ∈ Sn−1 (replace Sn−1 by then-dimensional space of non-negative real numbers, if you do not want to assumeprobabilism) for whichDKL(Q,R) = DKL(Q,S). (A.1)I will show something slightly stronger than Weak Convexity: Joyce’s inequalityis not only true for the midpoint between R and S but for all points ϑR+ (1− ϑ)S,as long as 0 ≤ ϑ ≤ 1. The inequality aimed for isDKL(Q, ϑR + (1− ϑ)S) ≤ DKL(Q,R) = DKL(Q,S). (A.2)To show that it holds I need the log-sum inequality, which is a result of Jensen’sinequality (for a proof of the log-sum inequality see Theorem 2.7.1 in Cover andThomas, 2006, 31). For non-negative numbers a1, . . . , an and b1, . . . , bn,209Appendix A. Weak Convexity and Symmetry in Information Geometryn∑i=1ai lnaibi≥(n∑i=1ai)ln∑ni=1 ai∑ni=1 bi. (A.3)(A.2) follows from (A.3) viaDKL(Q,R) = ϑDKL(Q,R) + (1− ϑ)DKL(Q,S) =n∑i=1(ϑqi lnϑqiϑri+ (1− ϑ)qi ln (1− ϑ)qi(1− ϑ)si)≥n∑i=1qi lnqiϑri + (1− ϑ)si = DKL(Q, ϑR + (1− ϑ)S). (A.4)I owe some thanks to physicist friend Thomas Buchberger for help with thisproof. Interested readers can find a more general claim in Csisza´r’s Lemma 4.1(see Csisza´r and Shields, 2004, 448) which accommodates convexity of the Kullback-Leibler divergence as a special case.210Appendix BAsymmetry in Two DimensionsThis appendix contains a proof that the threefold partition (4.59) of S1 is well-behaved, in contrast to the threefold partition of S2 as illustrated by figure B.1. Forthe two-dimensional case, i.e. considering p, q ∈ S1 with 0 < p, q < 1, p+ p′ = 1 andq + q′ = 1,∆q(p) > 0 for |p− p′| > |q − q′|∆q(p) = 0 for |p− p′| = |q − q′|∆q(p) < 0 for |p− p′| < |q − q′|(B.1)where ∆q(p) = DKL(q, p) −DKL(p, q) and DKL(p, q) = p log pq + (1 − p) log 1−p1−q . Partof information theory’s ill behaviour outlined in section 4.5.3 is that in the higher-dimensional case the partition does not follow the simple rule that higher entropyof P compared to Q implies that ∆Q(P ) > 0 (∆ here defined as in (4.58)). In thetwo-dimensional case, however, this simple rule applies.That a comparison in entropy H(p) = −p log p− (1− p) log(1− p) between H(p)and H(q) corresponds to a comparison of |p − p′| and |q − q′| is trivial. The prooffor (B.1) is straightforward given the following non-trivial lemma establishing a verytight inequality. Given that p + p′ = 1 and q + q′ = 1 and p, q, p′, q′ > 0 it is truethatIf log(p/q) > log(q′/p′) then (p+ q) log(p/q) > (p′ + q′) log(q′/p′) (B.2)Let x = p/q and y = q′/p′. We know that x > y since log x > log y. Now I want to211Appendix B. Asymmetry in Two Dimensionsshow (p+ q) log x > (p′ + q′) log y. Note that p = xq, q′ = p′y, p+ q = q(x + 1), andp′ + q′ = p′(y + 1). Therefore,q =1− y1− xy (B.3)andp′ =1− x1− xy . (B.4)What I want to show is that x > y implies1− y1− xy (x+ 1) log x >1− x1− xy log y. (B.5)Note that f(x) = (1 − x)−1(x + 1) log x is increasing on (0, 1) and decreasing on(1,∞), and consider the following two cases:(i) When x < 1, y < 1, (B.5) follows from the fact that f is increasing on (0, 1).(ii) When x > 1, y > 1, (B.5) follows from the fact that f is decreasing on (1,∞).Mixed cases such as x > 1, y < 1 do not occur, as for example x > 1 implies y > 1.212Appendix B. Asymmetry in Two DimensionsFigure B.1: The partition (4.59) based on different values for P . From top left to bottom right,P = (0.4, 0.4, 0.2);P = (0.242, 0.604, 0.154);P = (1/3, 1/3, 1/3);P = (0.741, 0.087, 0.172). Notethat for the geometry of reason, the diagrams are trivial. The challenge for information theory isto explain the non-triviality of these diagrams epistemically without begging the question.213Appendix CThe Horizon RequirementFormalizedConsider the following two conditions on a difference measure D on a simplex Sn−1 ⊂Rn, which is assumed to be a smooth function from Sn−1 × Sn−1 → R.(h1) If p, p′, q, q′ are collinear with the centre of the simplex m (whose coordinatesare mi = 1/n for all i) and an arbitrary but fixed boundary point ξ ∈ ∂Sn−1and p, p′, q, q′ are all between m and ξ with ‖p′−p‖ = ‖q′−q‖ where p is strictlyclosest to m, then |D(p, p′)| < |D(q, q′)|. For an illustration of this conditionsee figure 4.3.(h2) Let µ ∈ (−1, 1) be fixed and Dµ defined as in (C.2). Then dDµ/dx > 0, wheredDµ/dx is the total derivative as x moves towards ξ(x), the unique boundarypoint which is collinear with x and m.To define Dµ, the hardest part is to specify the domain. Let this domain V (µ) ⊆Sn−1 be defined asV (µ) ={{x ∈ Sn−1|xi < (1− µ)ξi(x) + µmi, i = 1, . . . , n} for µ > 0{x ∈ Sn−1|xi > (1 + µ)mi − µξi(x), i = 1, . . . , n} for µ < 0.(C.1)Then Dµ : V (µ)→ R+0 is defined asDµ(x) = |D(x, y(x))| (C.2)214Appendix C. The Horizon Requirement Formalizedwhere yi(x) = xi +µ(ξi(x)−mi). Remember that ξ(x) is the unique boundary pointwhich is collinear with x and m. Now for the proof that (h1) and (h2) are equivalent.First assume (h1) and the negation of (h2). Since D is smooth, there must bea µ¯ and two points x′ and x′′ collinear with m and a boundary point ξ¯ such thatDµ¯(x′) ≥ Dµ¯(x′′) even though ‖ξ¯ − x′′‖ < ‖ξ¯ − x′‖. If this were not the case, Dµwould be strictly increasing running towards the boundary points for all µ and itstotal derivative would be strictly positive so that (h2) follows. Now consider thefour points x′, x′′, y′, y′′ where y′i = x′i + µ(ξ¯i − mi) and y′′i = x′′i + µ(ξ¯i − mi) fori = 1, . . . , n. Without loss of generality, assume µ¯ > 0. Then x′, x′′, y′, y′′ fulfill theconditions in (h1) and Dµ¯(x′) < Dµ¯(x′′), in contradiction to the aforesaid.Then assume (h2). Let x′, x′′, y′, y′′ be four points as in (h1). Consider µ =‖ξ−m‖/‖x′′−x′‖. Then Dµ(x′) = |D(x′, x′′)| and Dµ(y′) = |D(y′, y′′)|. (h2) tells usthat along a path from m to ξ, Dµ is strictly increasing, so Dµ(x′) = |D(x′, x′′)| <|D(y′, y′′)| = Dµ(y′). QED.Note that the Euclidean distance function violates both (h1) and (h2) in alldimensions. The Kullback-Leibler divergence fulfills them if n = 2 but violates themif n > 2. For n = 2, this is easily checked by considering the derivative of theKullback-Leibler divergence for two dimensions (use Dε defined in (4.45) and (4.46)for the two-dimensional case instead of Dµ for the arbitrary-dimensional case). Acounterexample to fulfillment of (h1) and (h2) for n = 3 is given in (4.53).Now that I have shown that (h1) and (h2) are equivalent, it is easy to show thatJP , LP and ZP fulfill the horizon requirement while MP , RP and GP violate it. Thefollowing table of derivatives will do (note that ε = y − x is fixed while x varies andthat these derivatives have to be considered with the absolute value for the variousdegree of confirmation functions in mind). Column 1 is the name of the candidateconfirmation function; column 2 is the function with x and ε = y − x as arguments;column 3 is the derivative ∂D(x, ε)/∂x for |D(x, ε)| = D(x, ε).215Appendix C. The Horizon Requirement FormalizedMP (x, y) ε 0RP (x, y) logx+ εx− ε(x+ ε)xJP (x, y)x+ ε1− x− ε −x1− x1(1− x− ε)2 −1(1− x)2LP (x, y) log(x+ ε)(1− x)x(1− x− ε) (C.3)GP (x, y) log1− x1− x− εh(1− x)(1− x− εZP (x, y)ε1− x orεxε(1− x)2 or −εx2with−((x+ ε)(1− x)− (1− x− ε)x)(1− 2x− ε)(x+ ε)(1− x− ε)(1− x)x . (C.3)216Appendix DThe Hermitian Form ModelAsymmetric MDS is a promising approach to classify asymmetries in terms of theirbehaviour. This subsection demonstrates that Chino’s asymmetric MDS, both spa-tial and non-spatial, fails to give us explanations for information theory’s violationof transitivity of asymmetry. I am choosing Chino’s approach because it is themost general and most promising of all the different asymmetric MDS models (see,for example, Chino and Shiraiwa, 1993, where Chino manages to subsume many ofthe other approaches into his own).Multi-dimensional scaling (MDS) visualizes similarity of individuals in datasets.Various techniques are used in information visualization, in particular to displaythe information contained in a proximity matrix. When the proximity matrix isasymmetrical, we speak of asymmetric MDS. These techniques can be spatial (seefor example Chino, 1978), where the proximity relationships are visualized in two-dimensional or higher-dimensional space; or non-spatial (see for example Chino andShiraiwa, 1993), where the proximity relationships are used to identify data setswith abstract spaces (in Chino’s case, finite-dimensional complex Hilbert spaces)and metrics defined on them.The spatial approach in two dimensions fails right away for information the-ory because it cannot visualize transitivity violations. The hope for other types ofasymmetric MDS is that it would be able to distinguish between well-behaved andill-behaved asymmetries and either exclude or identify better-behaved candidatesthan the Kullback-Leibler divergence for measuring the distance between probabilitydistributions. I will use Chino’s most sophisticated non-spatial account to show thatasymmetric MDS cannot solve this problem. For other asymmetric MDS note thatwith the Hermitian Form Model Chino seeks to integrate and generalize over all the217Appendix D. The Hermitian Form Modelother accounts.Assume a finity proximity matrix. I will work with two examples here to avoidthe detailed and abstract account provided by Chino. The first example isD = 0 2 33 0 1−1 2 0 (D.1)and allows for easy calculations. The second example corresponds to (4.60), theexample for transitivity violation whereDˆ = 0.0000 0.0566 0.04870.0589 0.0000 0.04990.0437 0.0541 0.0000 , (D.2)and the elements of the matrix dˆjk = DKL(Pj, Pk). Note that the diagonal elementsare all zero, as no updating is necessary to keep the probability distribution constant.Chino first defines a symmetric matrix S and a skew-symmetric matrix T corre-sponding to the proximity matrix such that D = S + T .S =12(D +D′) and T =12(D −D′). (D.3)Note that D′ is the transpose of D, S is a symmetric matrix, and T is a skew-symmetric matrix with tjk = −tkj. Next I define the Hermitian matrixH = S + iT, (D.4)where i is the imaginary unit. H is a Hermitian matrix with hjk = hkj. Hermitianmatrices are the complex generalization of real symmetric matrices. They have218Appendix D. The Hermitian Form Modelspecial properties (see section 8.9 in Anton and Busby, 2003) which guarantee theexistence of a unitary matrix U such thatH = UΛU∗, (D.5)where Λ = diag(λ1, . . . , λn) with n the dimension of D and λk the k-th eigenvalueof H (theorem 8.9.8 in Anton and Busby, 2003). U is the matrix of eigenvectorswith the k-th column being the k-th eigenvector. U∗ is the conjugate transpose ofU . Given example (D.1), the numbers look as follows:H =12 0 5− i 2 + 4i5 + i 0 3− i2− 4i 3 + i 0 (D.6)andU = 0.019 + 0.639i −0.375 + 0.195i 0.514 + 0.386i0.279− 0.494i −0.169− 0.573i 0.503 + 0.260i−0.519 + 0.000i 0.681− 0.000i 0.516 + 0.000i (D.7)with Λ = diag(−3.78, 0.0715, 3.71). Λ is calculated using the characteristic polyno-mial λ3 − 14λ + 1 of H. Notice that the characteristic polynomial is a depressedcubic (the second coefficient is zero), which facilitates computation and will in theend spell the failure of Chino’s program for my purposes.Given example (D.2), the numbers areHˆ =12 0.0000 + 0.0000i 0.0578− 0.0011i 0.046 + 0.003i0.0578 + 0.0011i 0.0000 + 0.0000i 0.052− 0.002i0.0462− 0.0025i 0.0520 + 0.0021i 0.000 + 0.000i (D.8)219Appendix D. The Hermitian Form ModelandUˆ = 0.351− 0.467i −0.543 + 0.170i −0.578− 0.006i−0.604 + 0.457i −0.201− 0.169i −0.598 + 0.002i0.290− 0.000i 0.779 + 0.000i −0.555 + 0.000i (D.9)with Λ = diag(−0.060,−0.045, 0.104).Chino now elegantly shows how the decomposition of H = UΛU∗ defines a semi-norm on a vector space. Let φ(ζ, τ) = ζΛτ ∗. Then (i) φ(ζ1+ζ2, τ) = φ(ζ1, τ)+φ(ζ2, τ),(ii) φ(aζ, τ) = aφ(ζ, τ), and (iii) φ(ζ, τ) = φ(τ, ζ). These three conditions charac-terize an inner product on a finite-dimensional complex Hilbert space, but only if afourth condition is met: positive (or negative) definiteness (φ(ζ, ζ) ≥ 0) for all ζ).One might hope that positive definiteness identifies the more well-behaved asymme-tries by associating with them a finite-dimensional complex Hilbert space with thenorm ‖ζ‖ = √φ(ζ, ζ) defined on it (Chino himself speculatively mentioned this hopeto me in personal communication).The hope does not come to fruition. Without a non-trivial self-similarity relation,all seminorms defined as above are indefinite, and thus all cats grey in the night. Notonly are well-behaved and ill-behaved asymmetries indistinguishable by the light ofthis seminorm, even the seminorms for symmetry are indefinite. Not only doesthis not help my programme, it also puts a serious damper on Chino’s, who nevermentions the self-similarity requirement (which, given that we are dealing with aproximity matrix, is substantial).Based on a theorem in linear algebra (see theorem 4.4.12 in Anton and Busby,2003),n∑j=1λj = tr(A) (D.10)whenever the λj are the eigenvalues of A. The reader can easily verify this theorem220Appendix D. The Hermitian Form Modelby noticing that the roots of the characteristic polynomial add up to the secondcoefficient (which is the trace of the original matrix). It is well-known that theeigenvalues of a Hermitian matrix are real-valued (theorem 8.9.4 in Anton and Busby,2003), which is an important component for Chino to define the seminorm ‖ζ‖ withthe help of φ. Unfortunately, using (D.10), the eigenvalues are not only real, but alsoadd up to the trace of H, which is zero unless there is a non-trivial self-similarityrelation.Tversky entertains such self-similarity relations in psychology (see Tversky, 1977,328), and Chino is primarily interested in applications in psychology. When theeigenvalues add up to zero, however, there will be positive and negative eigenvalues(unless the whole proximity matrix is the null-matrix), which renders the seminormas defined by Chino indefinite. The Kullback-Leibler divergence is trivial with respectto self-similarity: DKL(P, P ) = 0 for all P .221Appendix EThe Powerset ApproachFormalizedLet us assume a partition {Bi}i=1,...,4n of A1 ∪ A2 ∪ A3 into sets that are of equalmeasure µ and whose intersection with Ai is either the empty set or the wholeset itself (this is the division into rectangles of scenario III). (MAP) dictates thatthe number of sets covering A3 equals the number of sets covering A1 ∪ A2. Forconvenience, I assume the number of sets covering A1 to be n. Let C, a subset ofthe powerset of {Bi}i=1,...,4n, be the collection of sets which agree with the constraintimposed by (HDQ), i.e.C ∈ C iff C = {Cj} and tµ(⋃Cj ∩ A1)= µ(⋃Cj ∩ A2)(E.1)In figures 7.5 and 7.6 there are diagrams of two elements of the powerset of {Bi}i=1,...,4n.One of them (figure 7.5) is not a member of C, the other one (figure 7.6) is.The binomial distribution dictates the expectation EX of X, using simple com-binatorics. In this case I require, again for convenience, that n be divisible by t andthe ‘grain’ of the partition A be s = n/t. Remember that t is the factor by which(HDQ) indicates that Judy’s chance of being in A2 is greater than being in A1. InJudy’s particular case, t = 3 and ϑ = 0.75. We introduce a few variables which lateron will help for abbreviation:n = ts 2m = n 2j = n− 1 T = t2 + 1 (E.2)EX, of course, depends both on the grain of A and the value of t. It makes sense222Appendix E. The Powerset Approach Formalizedto make it independent of the grain by letting the grain become increasingly finer andby determining EX as s → ∞. This cannot be done for the binomial distribution,as it is notoriously uncomputable for large numbers (even with a powerful computerthings get dicey around s = 10). But, equally notorious, the normal distributionprovides a good approximation of the binomial distribution and will help us arriveat a formula for Gpws (corresponding to Gind and Gmax), determining the value q3dependent on ϑ as suggested by the powerset approach.First, I express the random variable X by the two independent random variablesX12 and X3. X12 is the number of partition elements in the randomly chosen Cwhich are either in A1 or in A2 (the random variable of the number of partitionelements in A1 and the random variable of the number of partition elements in A2are decisively not independent, because they need to obey (HDQ)); X3 is the numberof partition elements in the randomly chosen C which are in A3. A relatively simplecalculation shows that EX3 = n, which is just what we would expect (either thepowerset approach or the uniformity approach would give us this result):EX3 = 2−2n2n∑i=0i(2ni)= n (use(nk)=nk(n− 1k − 1)) (E.3)The expectation of X, X being the random variable expressing the ratio of thenumber of sets covering A3 and the number of sets covering A1 ∪ A2 ∪ A3, isEX =EX3EX12 + EX3=nEX12 + n(E.4)If we were able to use uniformity and independence, EX12 = n and EX = 1/2,just as Grove and Halpern suggest (although their uniformity approach is admittedlyless crude than the one used here). Will the powerset approach concur with theuniformity approach, will it support the principle of maximum entropy, or will itmake another suggestion on how to update the prior probabilities? To answer thisquestion, we must find out what EX12 is, for a given value t and s→∞, using thebinomial distribution and its approximation by the normal distribution.223Appendix E. The Powerset Approach FormalizedUsing combinatorics,EX12 = (t+ 1)∑si=1 i(tsi)(tsti)∑si=0(tsi)(tsti) (E.5)Let us call the numerator of this fraction NUM and the denominator DEN. Accordingto the de Moivre-Laplace Theorem,DEN =s∑i=0(tsi)(tsti)≈ 22ns∑i=0∫ i+ 12i− 12N (n2,n4)(i)N (n2,n4)(ti) di (E.6)whereN (µ, σ2)(x) = 1√2piσ2exp(−(x− µ)22σ2)(E.7)Substitution yieldsDEN ≈ 22n 1pims∑i=0∫ i+ 12i− 12exp(−(x−m)2m− t2(x− mt)2m)dx (E.8)Consider briefly the argument of the exponential function:−(x−m)2m− t2(x− mt)2m= − t2m(a′′x2 +b′′x+c′′) = − t2m(a′′(x− h′′)2 + k′′) (E.9)with (the double prime sign corresponds to the simple prime sign for the numeratorlater on)a′′ =1t2T b′′ = (−2m) 1t2(t+ 1) c′′ = 2m21t2h′′ = −b′′/2a′′ k′′ = a′′h′′2 + b′′h′′ + c′′ (E.10)224Appendix E. The Powerset Approach FormalizedConsequently,DEN ≈ 22n exp(− t2mk′′)√1pia′′mt2∫ s+ 12−∞N(h′′,m2a′′t2)dx (E.11)And, using the error function for the cumulative density function of the normaldistribution,DEN ≈ 22n−1√1pia′′mt2exp(−k′′t2m)(1− erf(w′′)) (E.12)withw′′ =t√a′′(s+ 12− h′′)√m(E.13)I proceed likewise with the numerator, although the additional factor introducesa small complication:NUM =s∑i=1i(tsi)(tsti)=s∑i=1s(tsi)(ts− 1ti− 1)≈ s22n−1s∑i=1N(m,m2)(i)N(j,j2)(ti− 1)Again, I substitute and getNUM ≈ s22n−1(pi√mj)−1 s−1∑0∫ i+ 12i− 12exp(a′(x− h′)2 + k′) dx (E.14)where the argument for the exponential function is− 1mj(j(x−m)2 +mt2(x− j + 1t)2)(E.15)225Appendix E. The Powerset Approach Formalizedand thereforea′ = j+mt2 b′ = 2j(1−m)+2mt (t− j) c′ = j(1−m)2+m (t− j − 1)2(E.16)h′ = −b′/2a′ k′ = a′h′2 + b′h′ + c′ (E.17)Using the error function,NUM ≈ 22n−2 s√pia′exp(− k′mj)(1 + erf(w′)) (E.18)withw′ =√a′(s− 12− h′)√mj(E.19)Combining (E.12) and (E.18),EX12 = (t+ 1)NUMDEN≈ 12(t+ 1)√TtsT ts− 1seαt,sfor large s, because the arguments for the error function w′ and w′′ escape to positiveinfinity in both cases (NUM and DEN) so that their ratio goes to 1. The argumentfor the exponential function isαt,s = − k′mj+k′′t2m(E.20)and, for s→∞, goes toαt =12T−2(2t3 − 3t2 + 4t− 5) (E.21)226Appendix E. The Powerset Approach FormalizedNotice that, for t→∞, αt goes to 0 andEX =nEX12 + n→ 23(E.22)in accordance with intuition T2.227
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Information theory and partial belief reasoning
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Information theory and partial belief reasoning Lukits, Stefan Hermann 2016
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Information theory and partial belief reasoning |
Creator |
Lukits, Stefan Hermann |
Publisher | University of British Columbia |
Date Issued | 2016 |
Description | The dissertation investigates the nature of partial beliefs and norms governing their use. One widely accepted (though not uncontested) norm for partial belief change is Bayesian conditionalization. Information theory provides a far-reaching generalization of Bayesian conditionalization and gives it a foundation in an intuition that pays attention principally to information contained in probability distributions and information gained with new evidence. This generalization has fallen out of favour with contemporary epistemologists. They prefer an eclectic approach which sometimes conflicts with norms based on information theory, particularly the entropy principles of information theory. The principle of maximum entropy mandates a rational agent to hold minimally informative partial beliefs given certain background constraints; the principle of minimum cross-entropy mandates a rational agent to update partial beliefs at minimal information gain consistent with the new evidence. The dissertation shows that information theory generalizes Bayesian norms and does not conflict with them. It also shows that the norms of information theory can only be defended when the agent entertains sharp credences. Many contemporary Bayesians permit indeterminate credal states for rational agents, which is incompatible with the norms of information theory. The dissertation then defends two claims: (1) the partial beliefs that a rational agent holds are formally expressed by sharp credences; and (2) when a rational agent updates these partial beliefs in the light of new evidence, the norms used are based on and in agreement with information theory. In the dissertation, I defuse a collection of counter-examples that have been marshaled against entropy principles. More importantly, building on previous work by others and expanding it, I provide a coherent and comprehensive theory of the use of information theory in formal epistemology. Information theory rivals probability theory in formal virtue, theoretical substance, and coherence across intuitions and case studies. My dissertation demonstrates its significance in explaining the doxastic states of a rational agent and in providing the right kind of normativity for them. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2016-05-27 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0303149 |
URI | http://hdl.handle.net/2429/58193 |
Degree |
Doctor of Philosophy - PhD |
Program |
Philosophy |
Affiliation |
Arts, Faculty of Philosophy, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2016-09 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2016_september_lukits_stefan.pdf [ 2.06MB ]
- Metadata
- JSON: 24-1.0303149.json
- JSON-LD: 24-1.0303149-ld.json
- RDF/XML (Pretty): 24-1.0303149-rdf.xml
- RDF/JSON: 24-1.0303149-rdf.json
- Turtle: 24-1.0303149-turtle.txt
- N-Triples: 24-1.0303149-rdf-ntriples.txt
- Original Record: 24-1.0303149-source.json
- Full Text
- 24-1.0303149-fulltext.txt
- Citation
- 24-1.0303149.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0303149/manifest