UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A theory of physical probability Johns, Richard 1998

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_1999-389073.pdf [ 14.27MB ]
JSON: 831-1.0089241.json
JSON-LD: 831-1.0089241-ld.json
RDF/XML (Pretty): 831-1.0089241-rdf.xml
RDF/JSON: 831-1.0089241-rdf.json
Turtle: 831-1.0089241-turtle.txt
N-Triples: 831-1.0089241-rdf-ntriples.txt
Original Record: 831-1.0089241-source.json
Full Text

Full Text

A THEORY OF PHYSICAL PROBABILITY by RICHARD JOHNS B.Sc, The University of Nottingham, 1989 M.A., The University of Nottingham, 1991 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES (Department of Philosophy) We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA November 1998 © Richard Johns, 1998 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of DE-6 (2788) ABSTRACT It is now common to hold that causes do not always (and perhaps never) determine their effects, and indeed theories of "probabilistic causation" abound. The basic idea of these theories is that C causes E just in case C and E both occur, and the chance of E would have been lower than it is had C not occurred. The problems with these accounts are that (i) the notion of chance remains primitive, and (ii) this account of causation does not coincide with the intuitive notion of causation as ontological support. Turning things around, I offer an analysis of chance in terms of causation, called the causal theory of chance. The chance of an event E is the degree to which it is determined by its causes. Thus chance events have full causal pedigrees, just like determined events; they are not "events from nowhere". I hold that, for stochastic as well as for deterministic processes, the actual history of a system is caused by its dynamical properties (represented by the lagrangian) and the boundary condition. A system is stochastic if (a description of) the actual history is not fully determined by maximal knowledge of these causes, i.e. it is not logically entailed by them. If chance involves partial determination, and determination is logical entailment, then there must be such a thing as partial entailment, or logical probability. To make the notion of logical probability plausible, in the face of current opposition to it, I offer a new account of logical probability which meets objections levelled at the previous accounts of Keynes and Carnap. The causal theory of chance, unlike its competitors, satisfies all of the following criteria: (i) Chance is defined for single events. (ii) Chance supervenes on the physical properties of the system in question. (iii) Chance is a probability function, i.e. a normalised measure. (iv) Knowledge of the chance of an event warrants a numerically equal degree of belief, i.e. Miller's Principle can be derived within the theory. (v) Chance is empirically accessible, within any given range of error, by measuring relative frequencies. (vi) With an additional assumption, the theory entails Reichenbach's Common Cause Principle (CCP). (vii) The theory enables us to make sense of probabilities in quantum mechanics. The assumption used to prove the CCP is that the state of a system represents complete information, so that the state of a composite system "factorises" into a logical conjunction of states for the sub-systems. To make sense of quantum mechanics, particularly the EPR experiment, we drop this assumption. In this case, the EPR criterion of reality is false. It states that if an event E is predictable, and locally caused, then it is locally predictable. This fails when maximal information about a pair of systems does not factorise, leading to a non-locality of knowledge. ii TABLE OF CONTENTS Abstract ii Table of Contents iii Acknowledgement vi 1. Causation and Determination 1 1.1 Some Historical Background 2 1.1.1 Aristotle 2 1.1.2 The Scholastics 4 1.1.3 David Hume 6 1.1.4 David Lewis 10 1.2 Causation 15 1.2.1 Causation and Signalling 15 1.2.2 Causation and Time 17 1.3 Logical Relations 18 1.3.1 Determination 18 1.3.2 The Generalised Lagrangian 20 1.3.3 Determinism 25 1.4 The Principal Problem for Chance 26 1.5 The Causal Theory of Chance 28 1.5.1 Type Causation 29 1.5.2 Causes as Chance-Raisers 30 1.5.3 Signals Again 31 1.5.4 The Principal Problem Solved 31 1.5.5 Explanation 36 1.5.6 Overview of the Thesis 37 2. Logic and Probability 39 2.1 The Objections to Logical Probability 40 2.2 The Nature of Logical Probability 44 2.2.1 Bedeutung and Sinn 44 2.2.2 Entailment and Probability 51 2.2.3 Truth and Authority 60 2.2.4 Relative States of Affairs 62 2.3 Measuring Degrees of Belief 64 2.4 The Axioms of Probability 69 2.5 Relative Probabilities 73 2.6 Interval Probabilities 75 2.7 The Symmetry Axiom 77 iii 3. Physical Chance 97 3.1 The Definition of Chance 97 3.2 Chance is Relative to a System 101 3.3 Lewis's Objections 102 3.4 A Proof of Miller's Principle 104 3.5 The Objections of Howson and Urbach 105 3.6 Chance and Relative Frequency 107 3.7 Frequency Theories of Probability 113 3.7.1 VonMises 115 3.7.2 Popper 119 3.7.3 Howson and Urbach 122 3.7.4 VanFraassen 125 3.7.5 Conclusion 127 3.8 Conditional Chances 127 4. Classical Stochastic Mechanics 130 4.1 What is CSM Good For? 130 4.2 The Law Function 132 4.3 Relevance and Correlation 136 4.4 Chance in a Composite System 142 4.5 Sub-Histories, States and Markov Processes 148 4.6 Boundary Conditions and Time 151 4.7 The Arrow of Time 162 4.7.1 Time's Arrow and Causation 164 4.7.2 Temporal Symmetries of the Lagrangian 166 4.7.3 Time's Arrow and Chance 168 5. Correlation 182 5.1 Classical and Quantum Correlation 183 5.1.1 Classical Chance Correlations 183 5.1.2 The Rules of Quantum Mechanics 184 5.1.3 The Orthodox Interpretation 195 5.1.4 The EPR Argument 198 5.1.5 Predictive and Causal Locality 203 5.1.6 Bell's Theorem 204 5.2 Reactions to the EPR Argument 207 5.2.1 Bohr 207 5.2.2 Bohm 209 5.2.3 Everett 218 5.2.4 Other Approaches 224 5.3 Beyond Postulate CSM3 225 5.3.1 Factorisability and Consistent Families 225 5.3.2 Factorisability and Predictive Locality 231 5.3.3 Predictive Locality and Local Realism 232 iv 6. The State Vector 234 6.1 The Problem 234 6.1.1 The Physical Interpretation 235 6.1.2 The Epistemological Interpretation 239 6.1.3 The AND/OR problem 245 6.1.4 The Shifty Split 250 6.1.5 The Measurement Problem 255 6.2 Large and Small Systems 258 6.2.1 Emergent Properties 259 6.2.2 Saturated Models for Large Systems 263 6.2.3 Relative Models for Small Systems 265 6.2.4 Measuring Instruments 270 6.2.5 States for Small Systems 276 6.3 Chance for Small Systems 277 6.3.1 The Nonlocality of Chance 278 6.3.2 Coordinates for Chance 279 6.3.3 The Nature of Coordinates 282 6.3.4 The Nature of the State Vector 285 6.3.5 Time Evolution 288 6.4 Summary 289 7. Conclusion 293 Bibliography 295 v Acknowledgement I would primarily like to thank Andrew Irvine for his constant support, wise advice, and general determination to get me through. It is hard to see how he could have done a better job. Paul Bartha made many important contributions to this thesis, checking the technical details with great care, finding mistakes and making positive suggestions. The knowledge that he would be reading my work was a great motivation to get it right. Philip Stamp also had a profound impact on the thesis, particularly in his devastating attack on the first draft. This caused me to re-write two entire chapters from scratch, and greatly modify a third, thus strengthening the thesis quite considerably. His many (146) criticisms of the second draft also led to many substantial corrections, improvements and clarifications. It was David Papineau, my supervisor at King's, who first suggested that the foundations of probability might be up my alley, as "it is a mess". I want to thank my parents for being supportive of my move to Vancouver, to finish my program. I want to thank my house-mates Robert, Kevin, Kenn, Natalie, Paul, Darlene, Marie, Diane, Sally, Todd, Tessa and Lisa, who put up with my trying out certain ideas on them. Discussions with Robert and Kenn were particularly fruitful. I want to thank Gord, Ute and all at GCF for their prayers and support, particularly those who came to my defense: Christina, Kate, Richard, Jonathan, Mike, Katharine, and Glenna. I want to thank my friends at IPC, particularly Philip and Virginia Mason and Donald Drew, for their prayers and encouragement over many years. vi 1. Causation and Determinat ion This is a thesis about physical probability, or chance. The term chance has had a number of different meanings, but is now regularly used to refer to the kind of probability that is inherent in some physical systems. The half-life of a nucleus, for instance, is defined as the length of time in which the probability of decay is 1/2. The use of the term 'probability' here has nothing to do with any weight of evidence or degree of subjective certainty, as is the case in some other contexts, but appears to denote something objective and physical. Thus, when we say that in the period of its half life the chance of a nucleus decaying is 1/2, we mean this in an objective, physical sense. It will be granted, I think, that the chance of an event is a degree of something or other, so that a chance of one represents the possession of that thing in its fullest measure, and a chance of zero represents its complete absence. An event has chance one if it has some property F, and occurs with a lesser chance if it has F partially, or to some degree. Looking at chance this way, the principal question to be answered is: What is the property F? In this chapter we shall consider the two main types of answer that have been given to this question. The first is that F has something to do with causation, and the second is that F has to do with determination. In other words, if an event happened by chance this might mean that it was uncaused, or spontaneous. Alternatively, it could mean that the event was undetermined, or under-determined. My own theory of chance, which is developed and defended in this thesis, involves both causation and determination. The chance of an event, I claim, is the degree to which it is determined, or necessitated, by its causes. I call this the causal theory of chance. In this chapter I give a short survey of several historically important ideas about chance and causation, including those of Aristotle and Hume. I then outline my own view of chance, explaining its relations to previous ideas, and to other philosophical issues. Finally I will give an overview of the rest of the thesis. 1 1.1 Some Historical Background 1.1.1 Aristotle In Aristotle's Physics1, and indeed among ancient thinkers generally, a cause is what generates, produces, or brings about, some object, event or change. Reference to causes is central in explaining why objects exist, and why events occur. For instance, the father is a cause of the child, and generally what makes is a cause of what is made. As Aristotle states when he begins his enquiry into causes, "Knowledge is the object of our inquiry, and men do not think they know a thing till they have grasped the "why" of it (which is to grasp its primary cause)" (194b, 19-21). Aristotle developed this simple idea by distinguishing four different kinds of answer to the "why" question, i.e. the question as to why some object exists or some event occurs. First, there is the material of which the object is composed, or in which the change is wrought. If one asks, for instance, why this statue exists, one answer is that the bronze in the statue exists. This is not a complete explanation, of course, but the statue could not exist without the bronze. The bronze is the material cause of the statue. It is important here to lay aside the idea that causes temporally precede their effects. The idea is that the existence of the statue is continually supported by the existence of the bronze. Second, there is the form or the archetype of the statue. We must recall here the Platonic notion that the form (of square, say) is the characteristic in virtue of which an object is square. It is through the form's being present in the physical object, in some way, that the object is square. Thus the forms that are present in a statue support its existence as a statue, making it the thing that it is. This cause is referred to as the formal cause. These first two kinds of cause, although they do answer the "why" question to some extent, are not what immediately springs to mind when we wish to explain the existence of a 'Book I I , available in Barnes (1984). 2 statue. We would normally invoke the sculptor who actually manufactured it. We are not usually concerned with the metaphysical constituents which maintain an object in existence, but rather with the agent (whether personal or otherwise) which brought about some change. If there are muddy footprints on the new rug, and I want to know why they are there, I am not usually interested in what they are made of or what forms they instantiate. Rather, I want to know who or what put them there. This object (i.e. the dog, the room-mate, or whatever), which Aristotle calls the "primary source of the change" (194b, 30), is known as the efficient cause of the footprints. Efficient causes are often persons, or animals, but in many cases are inanimate objects. For instance, the efficient cause of a crater is often a meteor, and the sun and moon are the efficient causes of the tides. In modern parlance, the term 'cause' almost always means the efficient cause. The fourth kind of cause in Aristotle's classification is the purpose of the object to be explained. If one wonders why there is a metal rod at the top of a tall building, for example, one might be told that it is to protect the structure from electrical storms. In this case one is referred to the purpose, or goal, of the object. Since this seems to be a legitimate answer to the "why" question, Aristotle counts it as another kind of cause - the final cause. Final causes have been out of favour in science, over the past couple of centuries, but they seem to be making a comeback.2 It is becoming recognised that to explain something by reference to a final cause does not exclude another explanation in terms of efficient causes. Having said that the metal pole is a lightning conductor, one can also say that it was erected by so-and-so. Moreover, the antipathy to final causes seems partially due to a misunderstanding of what they are. They are not efficient causes, operating mysteriously from the future, somehow pulling an object into being. Indeed, it seems that a final cause cannot "operate" except through efficient causes. If someone erects a lightning rod, this is surely caused by a desire to protect the building, and this desire is an efficient cause. 2See, for instance, Ell iott Sober (1993). 3 In this thesis I use the word 'cause' in its modern sense, which corresponds roughly at least to Aristotle's efficient cause. One difference between ancient and modern discussions of causation is the tendency for moderns to regard causation as a relation between events, rather than objects. This does not seem to be a very substantial disagreement, but more a matter of terminology. One can see the dog as being the cause of the footprints on the rug, or alternatively one can consider the dog's walking on the rug to be the cause of the rug's having muddy footprints on it. These two perspectives appear to be quite compatible. In general I shall follow the modern approach and view causation as a relation between events, but where it is convenient I will use the (equally valid) older idiom. In Aristotle, chance is defined in terms of causation rather than necessity. He mentions the opinion of those who "say that nothing happens by chance, but that everything which we ascribe to chance or spontaneity has some definite cause" (196a 1-2). It is thus implied, though not explicitly stated, that an event which happens by chance has no (efficient) cause. Aristotle is unhappy with this account of chance since, even if we agree that all events are caused, we still require a distinction between those events which occur by chance and those which do not. Aristotle applies the concept of chance only to events which are caused by intelligent agents, in those cases where there is a mismatch between the agent's intended outcome and the actual outcome. For instance, suppose a man goes to the market to buy eggs, and while he is there runs into a friend of his. He met his friend by chance, we would say, because that was not his purpose in going to the market. This anthropocentric notion of chance, while no doubt useful, seems unconnected to the notion explored in this thesis. 1.1.2 The Scholastics In general the scholastics followed Aristotle in their ideas on causation, although one innovation is worthy of note. They distinguished between a causa fiendi and a causa cognoscendi. I do not want to endorse this distinction, as I think it contains a fundamental mistake, but it is a mistake to be learned from. What are these two causes? They actually to the two meanings of the English word 'because'. Consider the sentence "Grandfather is ill because he ate lobster yesterday"3. Here the lobster (or the event of eating it) is proposed as an efficient cause of Grandfather's illness. On the other hand, if we say "Grandfather is ill, because he is still in bed", then clearly the word 'because' does not mean efficient cause. Grandfather's being in bed is not an efficient cause of his illness, but rather it is seen as conclusive evidence for an illness. We are dealing with a logical relation between propositions rather than a physical relation between events. A causa fiendi is an efficient cause, but a causa cognoscendi is rather a logically sufficient ground. In the sense of causa cognoscendi, therefore, we might say that the premise of a valid argument causes the conclusion, or perhaps that the truth of the premise causes the conclusion to be true. Of course this scholastic distinction is valid, in that the two relations are certainly distinct, but in my view they are so distinct that the use of the term 'cause' for a logically-sufficient ground is very unwise. As we noted above, the fundamental idea of a cause is that which produces, or brings about, an object or event. Now, a causa cognoscendi does not produce, or bring about, the conclusion it entails. Rather, it supplies knowledge which enables one to infer the conclusion. Indeed, in the example above, where Grandfather's being in bed is the causa cognoscendi of his being ill, the actual causal relation is the very reverse of this. Grandfather's being ill causes him to remain in bed. The moral of this discussion is that we must be very careful to distinguish between logical and causal relations. This may seem an elementary point, but I believe that these two are systematically confused even at the present time. The distinction between causation and determination, for instance, which is still controversial, is of this very kind.' One of the fundamental conclusions of this thesis is that determination, unlike causation, is a logical relation, a matter of inference. 3 Th is example is taken f rom C. S. Lewis (1947:19). 5 1.1.3 David Hume Hume4 regarded causation and necessitation (determination) as the same relation. An event of type C causes, necessitates or determines an event of type E just in case C-events are invariably followed by £-events. This is usually known as a regularity theory of causation, although in the terminology of work on chance it would be called & frequency theory. We may express this definition in the following equivalent form: C causes E just in case, among the class of C-events, the relative frequency of these which are immediately followed by is-events is 1. Now, here is a relation between C and E which admits of degrees, as this relative frequency could (conceivably at least) take values which are less than one. Here then is what might be called (somewhat misleadingly) Hume's theory of chance. The chance of E, when C occurs, is the degree to which C causes E. This is just the relative frequency, among the class of C-events, of those which are immediately followed by E-events. To attribute this theory to Hume is misleading since Hume held that there is "no such thing as Chance in the world" (1748:§VI). Still, however, he must have meant something by "chance", even to deny its existence, and there is evidence that his meaning was that of the theory above. Hume notes that some causes are more "regular", or "certain", than others. For instance, while "fire has always burned, and water suffocated every human creature", "rhubarb [has not] always proved a purge, [nor] opium a soporific to every one, who has taken these medicines" (1748: §VI). There are two possible responses here. Hume takes the view that opium is just part of the cause of sedation, so that there are additional "secret" causes which may not be apparent to us. For instance, opium may only have its soporific effect when acting upon a certain type of constitution. Thus, according to Hume, if C is the total cause of E, then C-events are invariably followed by is-events. The other possible response is to assert that, no matter how much additional information we have about the cause C, there will always be some cases where C is not followed by E. 4 The relevant works are Hume (1739), (1748). 6 When Hume asserts that there is no such thing as chance in the world, therefore, he surely means that a total cause is invariably followed by its effect; any apparent exceptions to this rule are due to the operation of "secret causes". Putting it another way, all true chances are either zero or one. Interestingly, Hume does acknowledge the existence of what he calls probability, which is some sort of pseudo-chance. In the case where C-events are invariably followed by E-events, Hume points out that, on seeing an instance of C, we infer that E will follow. This inference may or not be justified, but it does in fact occur. In a similar way, Hume claims, if E follows C (let us say) 60% of the time, then upon seeing an instance of C we form a partial belief that E will follow, a belief with degree 60%. This degree of belief is what Hume calls "probability" (1748: §VI). We shall see the causal theory of chance is actually somewhat similar to Hume's, even though there are deep points of divergence between us. The most fundamental problem with Hume's account, in my view, is that his analysis of causation does not even faintly resemble our intuitive idea of this relation. As Aristotle says, C causes E just in case C makes E happen, i.e. C brings E about. Causation is a matter of ontological dependence, which is not the same as constant conjunction. When we observe a constant conjunction of two types of event we wonder if there a causal link between them, and of what kind. This is not merely to ask whether the conjunction will continue to occur. It is interesting to see how Hume approaches his analysis of causation - a notion which, he claims, is "more obscure and uncertain" than any other. He takes empiricism, the claim that all our ideas, including the idea of causation, derive from experience, as a premise, "all our ideas are nothing but copies of our [sense] impressions" (1748: §VII Part I). To understand the idea of causation, therefore, we must find the impression from which it is derived. Now, we clearly cannot directly observe a relation such as ontological dependence; one cannot form an sense impression of it. According to empiricism, therefore, our notion of causation cannot be the Aristotelian one - it is ruled out as a candidate from the very start. 7 What therefore is the original of our idea of causation, in Hume's view? He notes two things: (i) If C causes E, then C-events are invariably conjoined with E-events; (ii) Having experienced such a constant conjunction over a large number of instances, we immediately infer that an E will occur, whenever we observe a C. Hume then claims that the idea of cause and effect arises from a habit of the mind, which results from observing that one type of event is invariably followed by another. In other words, the idea of causal power comes from acquaintance with our own belief dynamics. ...the mind is carried by habit, upon the appearance of one event, to expect its usual attendant, and to believe, that it wi l l exist. This connexion, therefore, which we feel in the mind, this customary transition of the imagination from one object to its usual attendant, is the sentiment or impression, f rom which we form the idea of power or necessary connexion (§VI I , part I I ) . It is hard to reconcile this account of the origin of the idea of causation with Hume's regularity analysis of causation itself. One would think that, just as our idea of a robin is a copy of impressions of robins, our idea of heat is a copy of impressions of hot objects, and so on, that our idea of causation would be a copy of an impression of causation. In that case, our idea of causation would be a copy of sense impressions of constant conjunction, not human belief dynamics. Perhaps Hume is suggesting that our idea of causation contains a fundamental error, a projection onto the world of something that is really in the mind? This puzzle is deepened when Hume offers a second analysis of causation itself, in terms of human belief dynamics. He defines a cause as "... an object followed by another, and whose appearance always conveys the thought to that other" (§VU, Part II). This is obviously inconsistent with the first definition, so that it is hard to see what he may have in mind. Immediately after giving the second definition, he states that: 8 But though both these definitions be drawn from circumstances foreign to the cause, we cannot remedy this inconvenience, or attain any more perfect definition, which may point out that circumstance in the cause, which gives it a connexion with its effect. We have no idea of this connexion; nor even any distinct notion of what it is we desire to know, when we endeavour at a conception of it. Again, these remarks are difficult to interpret with certainty, but Hume seems to admit that the two notions, of (i) constant conjunction and (ii) habitual inference, are different from causation itself. Neither, it seems, touches the basic issue of what connects a cause with its effect. Nonetheless, he remains firm in his view that we have no idea of this connection, beyond our ideas of constant conjunction and habitual inference. Contrary to what Hume repeatedly says, it is quite clear that there is a common notion of a cause as that which produces an effect, or brings it into existence. This is shown by the fact that it makes perfect sense to say: I grant you that lightning is always followed by thunder, and that we all habitually expect thunder after lightning, but is it really the lightning that produces the thunder? Hume is right that our idea of causal power is not perfectly clear, but it is easily sharp enough for us to distinguish it from both constant conjunction and habitual inference. Since empiricism forces us to deny what is so plainly true, we should simply reject empiricism. This allows us to retain the Aristotelian conception of causation. This general notion of causation does not provide us with a priori knowledge of particular causal connections, of course, just as one can grasp the general concept of molecular structure without knowing the structure of any particular compound. As an account of necessity, or determination, Hume's theory is somewhat more successful. I show in §1.3.1 that necessary connection is a logical (inferential) relation, so that Hume's account is at least along the right lines. Indeed, the main problem with Hume's reasoning is his conflation of power or force with necessary connection, i.e. of causation with determination. 9 1.1.4 David Lewis David Lewis (1973:160) agrees with two important claims made above, that regularity theories of causation are false, and that causes need not determine their effects. He also thinks there are chancy events, although he takes chance as primitive and offers no analysis. Lewis does offer an analysis of causation, however, using chance. Suppose C and E both occur, and the chance of E was JC. Also, if C had not occurred, then the chance of E would have been y. Then C is a cause of E just in case x>y. In other words, a cause increases the chances of its effects. In the special case of a deterministic system this means (roughly) that C causes E just in case E depends counterfactually upon C; i.e. C and E both occur and, if C had not occurred, E would not have occurred either.5 This counterfactual analysis of causation is rather more plausible than Hume's regularity account. For one thing, instead of abandoning the common-sense notion of cause as meaningless, it attempts to clarify and sharpen that notion. I shall argue, however, that while there is a close link between causation and counterfactual dependence, this link alone does not permit a successful analysis of causation. To examine the relation between causation and counterfactual dependence it is helpful to look at a simple example of causation, where we understand fairly clearly what is going on. The best such example is where one object supports another, i.e. prevents it from falling under gravity. The supporting object causes the supported object to remain in place. A support C helps to prevent an object E from falling, under the force of gravity. It does so by exerting an upward force on E which, to a greater or lesser extent, balances the gravitational force. Thus, we shall say that C is a (partial) support of E just in case C exerts some upward force on E, however small it may be. Moreover, C is a total support of E if it is, 5This definition has to be modified slightly to make the relation transitive, but this detail does not concern us. 10 by itself, sufficient to prevent E from falling. The force exerted on E by a total support is equal and opposite to the weight of E.6 If C partially supports E, then what would happen if C were removed? Would E then fall? In some cases it would, but in others it would not. Consider a square table top, E, which is supported by four legs, one in each corner. Due to symmetry, each leg exerts a force on E equal to one quarter of the top's weight, so that each leg is a partial support of E. Now suppose that one leg were cut off; would the top fall? Normally it would not, in fact. The two legs nearest to the one removed would bear all of the weight, roughly speaking, while the one diagonally opposite would be almost redundant (depending on the exact geometry, which has not been specified). For a case where removal of a partial support does cause E to fall, consider a triangular table top supported by a leg at each vertex. Here, removal of any leg causes the top to fall. Now let us consider total supports. Does removal of E's total support cause E to fall? The answer is yes, in all normal cases, although one can contrive situations where this does not hold. I once heard of an architect who designed some kind of roof which was supported in an unorthodox way. It had no pillars where, at the time, it was customary to have pillars, being perhaps cantilevered or something. The builder, or perhaps the owner, was unhappy with the design, fearing that the roof would collapse, and insisted that pillars be added. To this the architect eventually consented, although to prove his point they were built so that there was a tiny gap, perhaps an inch or two, between the tops of the pillars and the roof. Thus, the pillars exerted no force at all on the roof, and so were not even partial supports of it. If the roof had slipped even slightly, however, it would have come to the rest upon the pillars and still not fallen. In such a case as this, it may be that E does not fall even if its total support is removed, as things are arranged so that a back-up support automatically comes into effect when this happens. Something which, in fact, is not a support would become a support. 6 T o keep things simple, I am neglecting consideration of moments. We should really consider the line of application of each force to ensure that E is also in moment equilibrium. 11 It is clear that Cs supporting E is wholly a matter of the force actually exerted by C on E, and is only indirectly connected with what would happen in the absence of C. Removal of C may reduce the upthrust on E to such an extent that it falls down, but that depends on factors which vary independently of the force exerted by C on E. Thus a counterfactual analysis of support would miss the essential point. It is no doubt possible to deal with some of these awkward examples by adding epicycles, but this is a clear sign of a degenerate analysis - is the notion of support really that complicated? A counterfactual analysis would also fail to make the important distinction between total and partial supports. This is all very well for the case of support, but perhaps support is not a good example of a causal relation? Perhaps it is not a genuine causal relation at all? To allay these fears I shall show that the same points can be made for a more standard type of example. Consider an atom bomb, which requires a conventional explosive as a detonator. Instead of a single charge of TNT, suppose we have four separate, similar charges, which all explode together, causing the fission reaction to begin.7 The four charges together are a sufficient "match" to ignite the main bomb, but let us suppose that each one singly is not enough. What would three of them have done? Both alternatives are surely possible here; by choosing the size of the charges, we can make three charges together either sufficient or not. In other words, we can have an analogue of either the square or the triangular table. If we suppose that three would have been enough, and consider one of the small explosions C, it seems that C is a partial cause of E, the atomic explosion. If C had not occurred, however, E would still have occurred, so that E does not depend counterfactually upon C. We see therefore that, in the case where three charges are insufficient, E depends counterfactually upon C, but not in the case where three are sufficient. Thus, according to Lewis' theory, any single charge (pick one!) causes the main explosion in the one case but not 7 W e may suppose also that each pair of small charges is symmetric within the entire set-up. 12 in the other. This is highly counter-intuitive. One might as well say that a leg on a triangular table supports the top, but a leg of a square table does not! Lewis does consider cases similar to this, saying that the four charges over determine E. This means that there is a smaller set of causes (pick any three charges) which are jointly sufficient to determine that the big explosion happens. Lewis avoids situations exactly like this, however, where there is symmetry between the causes, since (he says) "For me these are useless as test cases because I lack firm naive opinions about them" (1973:171, n.12). The problem with Lewis's intuitions seems to be the lack of a distinction between total and partial causes. The symmetry of the case prevents any single charge being identified as the (total) cause, and Lewis does not have the notion of a partial cause, so he does not know what to say. We see then that, although there is a link between causation and counterfactual dependence, it is not sufficiently tight to allow an analysis of causation in such terms. The fundamental reason for this is that causation is a non-modal relation, so that it depends only upon what exists in the actual world, whereas counterfactual conditionals are modal. Another problem with Lewis's modal approach concerns the relata of the cause-effect relation. Lewis wants causes and effects on his account to be singular, concrete events, rather than event-types, propositions about events, or any such abstract objects. This requires, however, that the same event can occur in different worlds, which is problematic. Yet this does not seem possible.8 At best, for some event E in the actual world, another world may contain a counterpart of E. It seems that, if causation is a relation between singular, concrete events, then it must exist within the actual world. Even so, the claim that the causal relation is non-modal has also been criticised. Hume's theory also has this non-modal feature, of course, as causation is a matter of constant conjunction in the actual world. Now a frequently-criticised part of Hume's theory has been his denial of any "necessary connexion" between cause and effect, as it seems to make the constant 8 For a modal realist like Lewis, transworld identity of events is problematic. For the rest of us, possible events are abstract entities of some kind, not concrete objects. 13 conjunction of cause with effect merely accidental or contingent. Hume's claim seems intuitively wrong; the effect does not merely follow the cause, however regularly, but rather the effect is brought about by the cause. There is surely some kind of intimate connection between the two events which is stronger than mere contiguity in spacetime. One obvious way to strengthen the connection is to make it modal, i.e. to extend it to include counterfactual states of affairs. This can be done in various ways, of course, resulting in relations of varying strength. The strongest connection would be if the effect follows the cause in all physically-possible worlds, so that the conjunction is necessary9. A weaker connection is achieved in Lewis's account. This modal approach springs to mind because it is the way we deal with a superficially similar problem, namely the difference between material and strict implication.10 The material conditional is rather weaker than logical entailment, as is well known. A material conditional is true, for instance, whenever its antecedent is (in fact) false. Strict implication can be defined in terms of the material conditional, however, by making it modal. If a material conditional holds necessarily, then it has to hold even in worlds where the antecedent is true - indeed, the consequent has to be true also in all these worlds at least. Thus necessary material implication is the same as strict implication. Bearing in mind the important distinction between causal and logical relations, we should be wary of the use of modality to understand the connection between cause and effect. Instead, following the example of support, I suggest that we imagine a sort of "ontological force" which causes exert on their effects. Indeed, I regard the relation of support, discussed above, to be a metaphor for causation of all kinds. A cause supports its effect in existence; it gives the effect being. In short, I view causation as ontological support. 9This necessity would be mere nomic, rather than logical, necessity. 1 0Lewis's account, of course, uses counterfactual conditionals, which are another kind of modal conditional. 14 1.2 Causation It is not the purpose of this thesis to offer a complete account of causation, but the analysis of chance given in Chapter 3 does rest upon certain assumptions about this relation. It may have been gathered from the discussions so far that my general approach to causation is Aristotelian, as I hold for instance that causation is a primitive relation that cannot be analysed into more basic elements. The nature of this relation seems to vary considerably from case to case. When a foot causes a footprint in soft sand it is physically pushing the grains of sand, moving them to new positions. When a woman and man cause a child, the relation is more complex, but it again concerns what occurs in the actual world. In this section I will try to shed some light upon the causal relation by explaining its connections to related topics. 1.2.1 Causation and signalling Consider a crude method of signalling between two fixed sites, connected by some link such as a wire. For simplicity, we shall suppose that one site is the transmitter and the other is the receiver, so that the signalling is always in the same direction. The basic principle is that the person at the transmitter manipulates the machinery at that end, which causes other events to occur at the receiver. A person at the receiver who observes these events can infer the message from what he observes. A simple example of such a set up is if the transmitter has a switch which can be turned to 'on' or 'off, and a lamp on the receiver is either lit or not according to the state of the switch. A message could then be sent using Morse code, for example. It seems intuitively clear that communication of this kind requires a causal link between the transmitter and the receiver. If twiddling the knobs on the transmitter had no effect on the receiver, then observing the receiver would convey no information about which knobs had been turned on the transmitter. In such an arrangement we would say that information can pass from the transmitter to the receiver, but what does this mean exactly? Let the transmitter be X and the receiver Y. X 15 can be set manually to either one of two states, 0 and 1, and Y is free of human influence. What does it mean to say that information can be passed from the transmitter X to the receiver Y? Suppose we observe Y, and its state is 1. From this we must infer something about the state of X. For this inference to occur, a necessary condition is that PK(X=\\Y=l) is different from PK(X=\), where PK represents the receiver-operator's epistemic probability. In other words, X and Y must be correlated in epistemic probability, given the knowledge of the receiver. Thus, for information transmission to occur, the receiver must have some relevant knowledge about this set-up. What kind of knowledge is this? The obvious answer is that he has some knowledge about physical chance. More exactly, he knows the physical chance11 of each state of Y for each possible state of X, such as the chance that Y moves to 1 when X is set to 1. Letting P represent physical chance, this knowledge will allow communication just in case P(Y=1) and P(Y=0) vary according to the state of X. The inference can then proceed as follows, supposing that P(Y=1) = p when X=0, and equals q when X=l. Using Miller's Principle12, that knowledge of the chance of an event authorises a numerically equal degree of belief, we have PK(Y=l\X=0) = p, and PK(Y=l\X=\) = q. Then, using Bayes's theorem, we have: PK(Y = \\X = l).PK(X = l) PK(X = \\Y = \) = K\ \ ) K\ > PK{Y = \\X = \).PK{X = \) + PK(Y = \\X = 0).PK{X = 0) «.P j r(X = l) q.PK(X = \) + p.PK(X = 0Y Thus, provided PK(X=0) differs from both 0 and 1, observing Y=l alters the receiver's epistemic probabilities for X. We see therefore that there is an important link between causation and chance, as Lewis, Suppes etc. believe13. Communication implies a causal link. It 1 ' i n fact he does not need to know for sure what the chances are; it is enough to have some idea of what they are. 1 2 M i l l e r ' s principle is discussed in detail in §3.2. 1 3 See Lewis (1986:175-184), Suppes (1970), Cartwright (1979) and Mel lor (1986). 16 also implies a correlation in the chance function. Lewis's definition of a cause as a chance raiser satisfactorily explains why a causal link is necessary for communication, and we shall see that the causal theory of chance can account for this as well. 1.2.2 Causation and Time One feature of the causal relation which is almost never denied is that causes temporally precede their effects. Is this true? If so, is it something which requires explanation, or must it be accepted as a brute fact? Is it a genuine fact at all? Perhaps it is just a linguistic convention. It is not possible to outline and argue for my view fully here, although the matter is discussed in detail in § 4 . 7 . My position may be summarised as follows, however. The actual, concrete history of a system may be partitioned into disjoint "time slices", each of finite duration.14 A time slice is a "chunk" of history that falls within a bounded interval of time [tx,t2), and so is a concrete event. It is a genuine fact, amenable to explanation, that each time slice is a total cause of one of its immediate neighbours, so that the causal relation defines a linear ordering on the set of slices. Using the metaphor of support, we can picture the time slices as a vertical stack of books. Each book in the stack (apart from the bottom one) is supported by those below, and supports those above. The book at the very bottom represents the earliest time slice,15 which gives being to all the rest. Although there is this close relation between causation and time in the cosmos, I view causal relations as essentially timeless. I consider it intelligible, for instance, that the cosmos itself have a cause, even though (since time is part of the cosmos) the cause and the effect could not be temporally related. Indeed, since I hold that time is just a physical dimension, like the spatial dimensions, even the causal relations between time slices are timeless. Each supports, timelessly, its successor in existence. 1 4 F o r simplicity, I am making the Newtonian assumption that spacetime is uniquely decomposable into space and time. This assumption is not necessary for the view itself. 1 5 Th is view of time does not require there to be a final time slice, but there has to be an init ial one. 17 It is in virtue of the causal structure of the cosmos that the past is "fixed", while the future is "open". Since the past is causally upwind of the present, actions in the present have no causal influence on the past. The past is ontologically deeper than the present. Present events are instrumental in bringing about future events, however, so the future is open to manipulation.16 The stack of books metaphor suggests an obvious question: What supports the book right at the bottom? This question takes us into difficult territory, and is well beyond the scope of the thesis. It does seem to me, however, that the causal structure of the cosmos reveals that it is not causally self sufficient. It is not a causa sui, but depends for its existence upon something else. 1.3 Logical Relations 1.3.1 Determination Lewis's definition of determination is as follows (1979:37): C determines E just in case for every dynamically-possible world in which C occurs, E also occurs. (A dynamically-possible world is one which satisfies all physical laws.) This definition, modulo some irrelevant differences in the wording, is now fairly standard.17 Let us suppose that the proposition I describes the laws of physics for the system in question. We now have that C determines £ just in case every logically possible world which satisfies I and C also satisfies E, i.e. 0(1&C -» E). Then, using the fact that A entails B, i.e. A=>5, just in case L3(A—>B), it follows that C determines E if and only if C entails E relative to I, i.e. I (C—>E). Determination is therefore a logical relation, being in fact the relation of entailment relative to I. Since it is a logical relation, its relata have to be logical entities, and cannot be 1 6 I can make no sense of the idea that there is something called the present that "moves". I take the "block universe" view that terms like "now" , "present", and so on are token-reflexive indexicals, l ike "here". 1 7 I believe it was first given by Montague (1962). 18 concrete objects. It is meaningless to assert that one physical object necessitates, or entails, another. Thus, since we are using "C" and for physical objects that are the relata of the causal relation, we need to define some corresponding logical entities for the determination relation, which we shall call m(C) and m{E). What are these entities, however? As a first approximation, we might say that m(Q is the state of affairs that C occurs, or exists. This is promising, since possible states of affairs are logical entities. One possible state of affairs can entail another, as for instance Vancouver's being very wet entails Vancouver's being wet. Also, states of affairs can be negated, conjoined and disjoined. This definition of m(C) will not do as it stands, however, since for any concrete event C there are many different states of affairs that C occurs. Consider, for instance, consider some particular, physical event C of Smith stubbing a toe. The states of affairs of Smith stubbing one of his toes and of Smith stubbing a toe on his left foot both concern the event C. Both of them are actual states of affairs. Which, if either, is m(Q? The problem here is that some states of affairs concerning C are more detailed than others; we might say that some "contain more information" than others. The obvious way around this problem is to define m(C) as the maximal actual state of affairs concerning the physical event C. In what sense is m(C) maximal, i.e. what ordering relation is being used? The ordering relation is just logical entailment or necessitation, so that m(C) entails all the other actual states of affairs concerning C. 1 8 One may worry that there is, perhaps, no maximal state of affairs concerning C, but instead an infinite sequence of such states of affairs, each of which entails its predecessors. In this case, however, the terms of the sequence are pairwise consistent, so that they have a conjunction. This conjunction entails each member of the sequence. We then have that C determines E just in case, relative to I, m(C) => m(E). According to the ordinary notion of causation as ontological dependence, causation is a relation between concrete entities, such as events. Consider a crater which is caused by a 1 8 A "state of affairs concerning C" is one that only concerns C, i.e. it does not give information about anything else. 19 meteor impact. It is the crater itself, the concrete event, which is brought about the impact, another physical event. It is not the states of affairs concerning these events which stand in cause-effect relations. Here then is the first reason why causation and determination must be distinct: One is a physical relation between concrete things (primarily), and the other is a logical relation between possible states of affairs (primarily). One has to say "primarily", as the causal relation could be extended to possible states of affairs, and the determination relation could be extended to concrete things, in the obvious way. We must be wary of the word "sufficient" in relation to causes. Suppose C is a total cause of E, so that C alone "exerts ontological force" on E. Then, since E does in fact exist, it seems reasonable to say that C is sufficient for E. After all C, by itself, was enough to bring E about. This is purely a fact about what did happen, in the actual world. It should not be confused with the claim that C is sufficient to determine E, so that m(Q entails m(E) relative to I, which is a modal claim. The latter says that any world that includes m(Q also includes m(E). We must distinguish therefore between causal and determinative sufficiency. 1.3.2 The Generalised Lagrangian In the previous section I introduced the symbol I, calling it a description of the laws of physics for the system in question. In this section I shall be a little more precise about what £ is, exactly. I call lx the generalised lagrangian for the system X. The concept of a generalised lagrangian is a somewhat Aristotelian one, as we may say that it represents the dynamical nature of the system. In the Physics19, Aristotle says that every thing that exists by nature has "...a principle of motion and of stationariness (in respect of place, or of growth and decrease, or by way of alteration)". An alteration of a system is according to nature if it proceeds from the system's innate principle of motion. If a motion has an external cause, being produced by a force acting on the system from outside, then the motion is 1 9 T h i s material is found in Book II, Chapter 1 (192b - 193b), and is available in Barnes (1984). 20 compulsory or unnatural. The downward gravitation of a dense body, for example, was considered to be a natural motion, a change produced by the body's internal impulses, and not the result of any external force. By the end of the seventeenth century, it seemed that this teleological approach to mechanics was infertile and bankrupt. In the light of Newton's achievements, the only change that could be considered natural was the motion of a particle in a straight line, at a constant speed. Since this rarely (if ever) occurs, the idea of a natural motion was of little value. Even gravitation was considered to be the result of an external force. The forces that produce motion were considered to be governed by laws, such as Newton's law of gravitation, and the law of action and reaction. Moreover, the relations between forces and motions were also governed by laws, such as the second law of motion. In general we may say that the idea of the dynamical nature (as a regulator of motion) was replaced by the concept of a law. The idea that motions are regulated by laws persists in the philosophical community even to the present day. This situation changed when Joseph Lagrange (1788) developed a system of mechanics, for deterministic systems, that is rather more Aristotelian. It will not be necessary to develop Lagrange's theory here, as the details are widely available,20 but the rough idea is to analyse motions not in terms of forces, but rather in terms of exchanges between different forms of energy. A particle in free fall, for example, is steadily converting its gravitational potential energy into kinetic energy, in such a way that the overall difference between these two is minimised. Indeed, the backbone of Lagrangian mechanics is now considered to be the teleological principle that the actual trajectory of a system is always the one that makes the action integral take a stationary value. (This is now known as Hamilton's Principle.) The action integral is simply 2 0 See, for example, Sposito (1976: Ch. 9). 21 <2 S = JL(x(t),v(t))-dt, ti where L(x,v) is the difference between the kinetic and potential energies of the system, and is known as the lagrangian of the system. The lagrangian depends, then, on the gravitational field that is the source of the potential. In this way, the gravitational field is viewed not as an external force that manipulates the particle, causing it to undergo unnatural motions, but rather as a part of the total "system". The total system acts according to its own nature, which is (roughly) to minimise the action. It turns out that any isolated system, no matter how complex, has a lagrangian function, provided that all the forces involved are conservative. This lagrangian captures all the dynamically-relevant information about the system, so that Hamilton's Principle is sufficient to determine the actual motion of the system, from its lagrangian, once a boundary condition is supplied. In short, the lagrangian of a system X, together with Hamilton's Principle, may be viewed as a representation of X's dynamical nature. It is natural for a system to behave in such a way that its action is minimised. If a system X has a dynamical nature, then we can define l x , the generalised lagrangian for X, as the maximal state of affairs concerning X's dynamical nature. The state of affairs l x can then be used to define the determination relation, as in the previous section. Determination is entailment relative to the generalised lagrangian. The generalised lagrangian is a replacement of, and an improvement upon, the concept of a physical law. It does all the same work, and more. Are these two approaches to mechanics, using laws and using dynamical natures, really any different, however? It may seem that they are, at bottom, fully equivalent, by the following argument. In Lagrangian mechanics one typically does not make explicit use of Hamilton's Principle. Instead, to calculate the motion of the system, one uses Lagrange's equations of motion, which for a single particle are just: 22 d_ dt * - 0. dx Now, since dL/dv is the momentum of the particle, and BL/dx is (minus) the gradient of the potential, i.e. the applied force, this equation entails Newton's second law of motion. Thus, it may seem, the introduction of a "dynamical nature" for the system is a rather superficial change. It may be an aid to calculation, but it makes no difference in theoretical terms. This argument faces four objections. First, the Lagrangian formalism is more general than the Newtonian one. It is possible to write down a lagrangian for many systems that cannot be analysed using Newton's methods. This suggests that the concept of a dynamical nature is fruitful. Second, it should be admitted that the introduction of a dynamical nature does not show the concept of a law to be invalid. Rather, it enables us to obtain a much better understanding of what a law of nature is. Suppose some proposition A truly describes the actual history of a system X. Is A a law, or merely an accidental fact? Using the notion of a dynamical nature for X, we can say that A is a law (i.e. nomically necessary) just in case ZX=>A. The point is that the idea of a dynamical nature is more fundamental and general than the idea of a law. On this view, a law is characteristic of the dynamical nature of the system. Its necessity consists of the fact that it is entailed by lx, which maximally describes that dynamical nature. The modal force of the law, in other words, derives from its logical relation to the dynamical nature. This account of laws solves the inference problem, as van Fraassen (1989: 29; 38-39) calls it, of showing that the inference of "A is true" from "A is a law" is valid. This is immediate, since given lx=>A, and given that £ x is true, we infer that A is true by modus ponens. The third advantage in introducing the generalised lagrangian lx is that it offers a superior understanding of the dynamics of a stochastic system. This matter is discussed in detail in chapters 3 and 4, but the rough idea is as follows. Stochastic laws involve probability, 23 which is a kind of modality. Now, if our fundamental account of the dynamics of a system is in terms of a stochastic law, such as a stochastic equation of motion, then the probabilities are built into the fundamental dynamical description of the system, and so their nature remains mysterious. The description £ x of the dynamical nature, on the other hand, does not involve probabilities. £ x talks about what is there, in the system, rather than about what will probably occur. The probabilities arise from the logical relation of partial entailment between £ x and propositions about the motion of X. A stochastic system does not have a lagrangian in the usual sense, because the forces are random, and so the potential energy is not well defined. It does not follow, however, that the more general concept of a dynamical nature is not valid for stochastic systems. One of the main theoretical claims of this thesis is that every system has a generalised lagrangian. The fourth advantage of the dynamical nature is that it can be understood as a cause of the motion of the system. Suppose a pendulum vibrates with a particular frequency. What is the cause of this frequency? It is easy to show that the frequency of a pendulum is approximately independent of both the amplitude of its motion and the mass of the bob. It depends only upon the length of the string. It seems clear, therefore, that the frequency of the vibration is brought about, or caused, by the length of the string. The length of the string is part of its intrinsic dynamical nature, represented in the generalised lagrangian, so that in general we can say that a system's dynamical nature is one of the causes of its behaviour. A law, on the other hand, does not cause anything. It is merely a statement about the motion of the system. I say that £ x describes only one of the causes of X's behaviour since, in the pendulum example, the dynamical nature is not sufficient to cause a particular amplitude of vibration. There can be two pendula which are exactly similar, so that they have the same generalised lagrangian, yet whose amplitudes of oscillation are quite different. This difference is due to the fact that one is initially given a large kick to begin its motion, and the other only a little push. Thus a second cause of a system's behaviour is the manner in which it is set going, which is usually represented by a boundary condition. 24 1.3.3 Determinism Determinism has been defined as the claim that every event21 has a cause. Those advancing this definition, however, believed that causation and determination are the same relation, or at least necessarily equivalent. If the relations are distinct, then it makes sense to define "determinism" in terms of the determination relation, rather than the causal relation. So the rough idea of determinism is that every event is determined. But determined by what? The usual answer is that, in a deterministic system, every event is determined to occur by past events, or the total past, in the system in question. Why is this, however? Why be interested in determination by past events, rather than future events (or a bit of both)? The answer is not that future events do not determine past events, for examples of this are very common. Consider the weather, for instance. It is doubtful whether November's weather is determined by the state of the world in October; at least, humans are not very good at drawing correct inferences about November's weather from information about October. By contrast, November's weather is very precisely determined by the state of the world in December, as in December there exist very precise records of November's weather, from which facts about the weather itself may be inferred with a high degree of certainty. We can "predict" temperatures, pressures, levels of rainfall, etc. with great accuracy. The reason why one is interested in whether events are pre-determined, I believe, is that causes temporally precede their effects. We are really interested in whether an event has causes which determine that it will occur. The causes of an event are contained in its past, so if an event is determined by its causes then it is pre-determined. Or every contingent event, to avoid a regress. 25 1.4 The Principal Problem for Chance As stated above, the analysis of chance to be offered in this thesis is that the chance of an event is the degree to which it is determined (i.e. entailed) by its total cause. Thus, for there to be such things as genuinely chancy events, there must be events which are not (fully) determined by their causes. Now, until fairly recently, it was seen as obvious that if C is a total cause of E, then C determines E as well. This was not a careless, unfounded assumption, but is supported by an argument which I consider to be valid, though not sound. This argument I call the Principal Problem for Chance, as if it is sound then there are no chancy events. The Principal Problem is described in one special case by van Fraassen (1989:239-240), in his consideration of Buridan's ass. Van Fraassen argues that, if we accept the claim that an asymmetry must always come from a prior asymmetry, then determinism follows. (Let us call this principle, that symmetry is preserved under causal evolution of a closed system, Conservation of Symmetry, or CS.) Here is how it works. The hungry donkey is faced with two tasty-looking bales of hay, but unfortunately the entire situation is exactly symmetric with respect to the two bales. If the ass were to move even an inch toward one of the bales, then the symmetry would be broken, which is impossible (by hypothesis). If the donkey were to choose one bale over the other, then the system would be indeterministic, since this event cannot be predicted from the initial state of the system. (If it could, that would break the symmetry.) Thus, any event which violates CS must be indeterministic. The converse, that any indeterministic event violates CS, is not so easy to prove, however. Of course, this is the result van Fraassen actually needs, to show that CS implies determinism. The result van Fraassen requires may easily be shown if we consider a more complicated case. Two asses are better than one here, particularly if they are exactly similar to one another, and set up in the same initial state. Since they are "clones"22, let us call them 2 2 N o t e that they are not merely clones in the usual sense, of sharing the same genotype, but are "atom-for atom" similar. 26 Dolly and Polly. The initial states of Dolly and Polly need have no symmetry, but the similarity of Dolly with Polly means that the combined system <DolIy, Polly> has symmetry with respect to the two beasts. Now suppose that Dolly and Polly are placed in exactly similar (separate) environments. The question is whether they will engage in exactly the same behaviour. If Dolly lifts her left hind leg, will Polly do the same? Will they blink simultaneously? If Dolly and Polly are indeterministic, then their being similar (having the same lagrangian) and having the same initial state does not entail that they will have identical histories; it is possible that their histories will differ. Here is the argument. Suppose that the donkeys are stochastic, and have different histories in fact. Then, at some time, the total system <Dolly, Polly> is not symmetric with respect to the two animals. But, in its initial state, <Dolly, Polly> was symmetric in this way, so CS fails here. Thus indeterminism entails that CS is false, and so CS implies determinism, as required. Why should one think, however, that an asymmetry cannot arise spontaneously out- of symmetry? The intuitive reason is that such a breakdown of symmetry could only occur if there were a hiatus in' the causal nexus - the first symmetry-breaking event could not have a total cause. It would be an event "from nowhere". The event has partial causes, no doubt, such as the donkey's being hungry and so on, but there is no event which is causally sufficient for this choice. This follows immediately from the symmetry of the situation. If there is a prior event which is causally sufficient for the ass to choose bale A , then (by symmetry) there is also an event which is causally sufficient for the donkey to choose bale B. This is a contradiction, however, as the ass cannot choose both bales. (Even if it eats both, it must eat one of them first.) It is now considered fairly well established that not all systems are deterministic, and so van Fraassen concludes that CS is false, so that some events are not fully caused (1989 :240). These are "events from nowhere", in other words, ones that simply occur without anything 27 making them occur. It is agreed that events determined to occur by prior causes are not chancy, so that on this view chancy events are just such uncaused (i.e. not fully caused) events. The main problem with this idea of chancy events as uncaused is that it is very hard to reconcile with the phenomenon of quantum-mechanical correlation. In order to account for the fact that some events are pre-determined, but not locally pre-determined, we are forced to regard even chancy events as fully caused. This issue is discussed in Chapter 5. 1.5 The Causal Theory of Chance According to the causal theory of chance, the chance of an event is the degree to which it is determined by its total cause. There are two causes of a system's concrete history23, represented by the generalised lagrangian and the boundary condition. If we write the boundary condition of a system X as bcx, then the chance of an event E is the degree to which it is logically entailed by lx & bcx. It should be noted that this theory coincides with Lewis's analysis of probabilistic causation24, to a certain extent. Lewis's rough idea is that a cause is a chance-raiser; thus, if C caused E, then E's chance would have been less had C not occurred. We have seen the flaws in such a counterfactual analysis of causation, but it remains true that causes do usually raise the chances of their effects. This fact is easily demonstrated from the causal theory of chance, as is shown below in §1.5.2. The term "causal theory of chance" is intended to indicate this reversal of priority, by being the reverse of "the probabilistic (chancy) theory of causation". Instead of taking chance as primitive and attempting to define causation (which is unworkable) I take causation as 2 3 T h e concrete history of a system is a convenient "super-event" which includes all the events that occur wi thin a system. 2 4 I am focusing on Lewis's account of probabilistic causation as it is the best of its k ind, but there are many other such analyses. 28 primitive and attempt to define chance. Many of the connections between chance and causation are unaffected by this reversal in the order of analysis. 1.5.1 Type Causation So far we have been considering the kind of causation that is a relation between singular, concrete events. Sometimes, however, the term is used to denote a relation between types of event, as in the statement that smoking causes heart disease. What is the meaning of such claims? First we shall suppose that an event-type is a general property of events, and thus may be considered as a proposition about an event with the particularity of that event removed. Thus an event-type is a Fregean concept25, such as "x is an explosion", "x has heart disease at time f", and so on. Now let us consider a particular person who smokes, and then gets heart disease as a result. To say that his heart became diseased because of his smoking is to say that, somehow, the smoking was part of the chain of events that led up to, or brought about, the disease. Now, of course, the actual story of how the heart became diseased is very complex, involving innumerable factors. Many of these factors, moreover, are quite normal and healthy in themselves, such as the eating of food and the circulation of blood. The act of smoking, therefore, is merely a partial cause of the heart disease. Let us look at some possible meanings of "smoking causes heart disease", trying to reduce it to the meaning of "cause" in the singular, concrete case. Does it mean that every case of smoking causes a case of heart disease in that person? No, because some smokers never get heart disease, and so in such people smoking does not cause heart disease. Perhaps then we should weaken it to "for some smokers, smoking causes them to develop heart disease"? This is far too weak, however, as in this sense running causes heart disease, as does breathing, 25SeeFrege(1891). 29 eating, blood circulation and so on. For many people, these are parts of the process which leads to heart disease. We want to be able to say something like: "smoking causes heart disease more often than does not smoking", or more precisely, "smoking has a higher chance of causing heart disease than does not smoking". For simplicity let us assume that all smoking humans share the same lagrangian, as far as the development of disease is concerned, as do all non-smokers. If S represents the property of being a smoker of a particular kind, so that Ps represents the chance function for a smoker, etc., and H the property of developing heart disease, then the claim that smoking causes heart disease is just that PS(H) > P_^(H). We see therefore that, in the generic sense of causation, a cause is a chance-raiser. 1.5.2 Causes as Chance-Raisers Let us consider a die that has been weighted on one face, so that the chance26 of a six is 2/3 rather than 1/6. In the generic sense of causation just discussed, we can say that the weighting of the die causes it to give the result 6. Let us now look at an individual roll, however, on which the die comes up 4. In this case, the weighting is part of the cause of its yielding a 4! The weighting is part of the dynamical nature of the die, which is a partial cause of the outcome. We see therefore that a cause in the proper sense cannot be defined as a chance-raiser, as a cause is sometimes a chance-lowerer. In a case where the weighted die yields a 4, the weighting is a partial cause of this, even though it greatly lowers the chance of a 4. It may be that the weighting reduces the chance of a 4 from 1/6 to 1/15, for example. If we look at a typical27 large class of trials, rolling weighted and non-weighted dice, however, then we do find in the class that the weighting, being a chance-raiser, is more frequently a cause of sixes than non-weighting is. A weighted die causes (in the proper sense) a 2 6 L e t us suppose that the die is genuinely stochastic. 2 7 A large class is typical i f the relative frequencies of outcome-types are roughly equal to the chances. It w i l l be shown in Chapter 3 that chance and frequency are linked in this way. 30 six more frequently than does a non-weighted die, from which it follows that a cause (in the proper sense) usually raises the chance of its effects. A type of cause that lowers the chance of some effect will not cause that effect very often. Thus, the causes that actually succeed in producing their effect are usually ones that give the effect a relatively high chance. 1.5.3 Signals Again Let us consider again what is required for a signal to be transmitted from X to Y, by Morse code. In §1.2.1 we saw that, for this to occur, the chance of Y=l must vary according to the state of X. This condition is met by the causal theory of chance just in case the state of X is at least a partial cause of the state of Y. For, if the state of X is among the causes of the state of Y, and the chance of an event is the degree to which it is determined by its causes, then the chance function for Y may vary as the state of X varies. On the other hand, if the state of X is not even a partial cause of the state of Y, then P(Y=1) is the same for both states of X. Thus, the fact that a causal link is necessary for signalling is a consequence of the causal theory of chance. 1.5.4 The Principal Problem Solved I assume that all events are fully caused, so that CS holds. How can this position be maintained in the face of the Principal Problem described above? We should remember that causation is a relation between C and E, concrete particulars, whereas determination is a relation between m(Q and m(E), the maximal states of affairs concerning C and E. An important step in the two-ass argument was the supposition that Dolly and Polly are in the same initial state. What I take this to mean is that, if C and C are the concrete initial time slices of Dolly and Polly respectively, then m(C) and m(C) are congruent. Dolly and Polly initially have congruent maximal states of affairs. From this it is inferred that the system <Dolly, Polly> is initially exactly symmetric with respect to the two beasts. Now this inference is valid if m(C) and m(C') are "complete", in some sense, for in that case any difference in C 31 and C would show up as a difference between m(Q and m(C). If m(Q and m(C) are incomplete, however, then the inference is not valid. In this case there could be a difference between C and C even though m(Q and m(C) are identical. Thus, a necessary hidden premise of the argument is that maximal states of affairs concerning events are always complete. With this premise the argument is valid, but otherwise it is not. I claim that the premise is false, so that the Principal Problem rests upon an unsound argument. What is meant by "complete" here? The intuitive idea is that a complete state of affairs m(Q provides complete information about the physical event C - it leaves nothing out. Thus the completeness of m(Q would consist in its having some sort of equivalence, or correspondence, to C itself. This relation would connect two very different kinds of entity, however, one logical and the other physical, so I claim that it could not be anything like a perfect matching. To support my claim that no possible state of affairs is complete, I shall first consider and criticise an argument that maximality entails completeness. I shall then give two arguments in favour of my view. One may think that a state of affairs m(a) is complete regarding an object a just in case m(a) contains a complete list of the properties of a - a list in which none of the properties of a is left out. If this understanding of completeness is adequate, then a maximal state of affairs has to be complete. For, if m(a) were incomplete, then it would be missing some property F that a in fact possesses. In that case, the state of affairs (m(a) & Fa) would also be actual, and of course it entails m(a). It follows that m{a) is not maximal, so that incompleteness implies sub-maximality. Therefore, a maximal state of affairs must be complete. The flaw in this argument lies in its understanding of completeness. We must first remember that properties are just abstractions from states of affairs, as for instance the property F is abstracted as the "common component" of states of affairs like Fa, Fb, Fc, etc. What, therefore, is a complete list of properties? A list such as {Fu F2, F3} is complete for an object 32 a just in case there is no other property (F4 say) that a possesses, but is not entailed by any of Fj, F 2 , or F 3 . But this is just to say that the state of affairs (F{a & F2a & F3a) is maximal, as I define it! This argument proceeds by re-defining a maximal state of affairs as complete. To clarify this point I should explain further the difference between the concepts of maximality and completeness. The notion of a maximum is derived from an ordering relation > on a class O, say. A member x of O is a maximum of O with respect to > just in case, for every y in <E>, x > y. The notion of completeness, or perfection, is quite different from this. A perfect object is one that matches, or corresponds to, some external standard. Consider, for example, the idea of completing a one-mile race. The length of the race is marked out in advance, and an athlete has completed the course when she arrives at the finishing post. Suppose that, after running about half a mile, Janet is leading the race. Her position is then maximal, although it is not complete. Its maximality depends upon the comparison of her position with the positions of other athletes - the length of the course is irrelevant here. The maximality of an object is an internal affair, being a matter of its relation to other objects of the same type. The incompleteness of her position, on the other hand, has nothing to do with the other athletes, but concerns her relation to the finishing post. The completeness, or perfection, of an object is an external affair, being a matter of conformity to an external standard. For another example, consider the problem of filling a bucket with water. The bucket is maximally full when the mass of water it contains cannot be increased, so that no fuller state is possible. It is completely, or perfectly, full when the level of the water coincides with the top of the bucket. Now, in this case the states of being maximally and completely full are extensionally the same, at least roughly, but notice how the concepts themselves are defined differently. In this section we are concerned with the issue of maximality and completeness of information, so let us consider the example of maps. In the spirit of Definition we shall say that one map A entails another, B, just in case the pair of maps {A,B} gives the same knowledge as the single map A. If O is a class of (true) maps, then the map A is maximal in O 33 just in case it entails all the other maps in O. What would it mean for a map to be complete? A map is a representation of a certain territory, so a complete (or perfect) map is one that corresponds the that territory exactly. A complete map would therefore be indistinguishable from the territory it represents. Such a map would have to be the same size as the territory, and composed of the same materials, so it would not really be a map at all, but a full-scale replica. We see that, since a map has to be made of paper, or some such two-dimensional material, then it is fundamentally different from the territory it depicts. Thus even a maximal map must be incomplete. So far I have argued that the supposition that maximality entails completeness rests upon a confusion about the concepts maximal and complete. Now I shall argue that there are strong reasons to hold that even a maximal state of affairs is incomplete. The first argument is direct: We need a strong distinction between the real and the merely possible in order to avoid modal realism. The second argument is rather indirect: The causal theory of chance, which solves a number of difficult problems, requires that maximal states of affairs be incomplete. For the first argument, that accepting complete states of affairs would commit us to modal realism, it is convenient to think about the whole world, so that we are concerned with the relation between the real world W and the maximal actual state of affairs (or actual world) m(W). We have already seen that Wand m(W) cannot be identified, since Wis a concrete object whereas m(W) is a logical entity. In spite of this, one might still think that the difference between W and m(W) is rather trivial, so that they can be regarded as exact replicas of each other. If this were so, however, then W and m(W) would be similar in ontological status. It would then follow that the other, non-actual possible worlds like m(W), whose status is equal to that of m(W), would be just as real as W. This amounts to modal realism, which I take to be false. The alternative to modal realism is to say that the possible worlds (including the one that is actual) are abstract, or ersatz. I do not want to discuss the exact meaning of "abstract", but I take it that the ontological status of possible worlds is somehow drastically inferior to that 34 of the real world. Should we be morally concerned about possible catastrophies that might have occurred but did not, in fact, happen? Is there anything that it is like to be a possible person? Do possible people suffer? I assume that the answer to each of these questions is No. The second argument for the claim that even maximal states of affairs are incomplete is that this allows us to understand the nature of physical chance. If a perfect correspondence between a concrete object and an abstract state of affairs is impossible, even in principle, then even a total cause need not determine the existence of its effect. This permits us to define the chance of an event as the degree to which it is determined by a total cause. As will be shown in the remainder of the thesis, this causal theory of chance is far superior to all of its competitors. It is particularly successful in making sense of the behaviour of chances in quantum theory. Consider, for instance, the familiar claim that, if quantum mechanics is incomplete, then there are hidden variables. Bearing in mind the distinction between completeness and maximality, we see that this claim is invalid. If quantum mechanics were maximal as well as incomplete then there would be no hidden variables. I argue in Chapter 5 that quantum mechanics is both maximal and incomplete, as this view makes sense of the non-locality of chance in quantum mechanics. We still have not provided an answer to the question of how W and m(W) are related. This is a difficult task, but the following picture may help. Leibniz imagined a God who surveyed the possible worlds and, finding the best, chose it as the real world. Creating the real world cannot be just a matter of taking one of the possible worlds "off the shelf and re-naming as the real world, however. In that case, the possible worlds would have the same fundamental nature as the real world, which amounts to modal realism. Rather, the actual world, once selected, would have to undergo some mysterious process of "concretisation", to turn it into the real world. The actual world m(W), on this view, is something like the formal cause of the real world W. The actual world helps to bring about the existence of real world, but the two cannot be identified. The actual world is merely the pattern, structure, or skeleton, of the real world. 35 1.5.5 Explanation The concept of explanation is very familiar. Indeed, that is exactly where we began this chapter, with Aristotle's idea that a cause is just the thing invoked in answer to a "why" question. Why are there muddy footprints on the rug? Because the dog walked on the rug, after being outside. The dog's action, which is the cause of the footprints, seems also to be their explanation. Empiricists, such as Hume and his successors, cannot accept this account of explanation, as they hold that causal claims are meaningless. Lacking causal relations, they turn instead to logical relations, and offer us some version of the "Deductive-Nomological", or D-N, model of explanation.28 The basic idea is that, to explain a physical phenomenon, one must deduce it from laws.29 In my view, Aristotle and the empiricists are both half right. To explain a phenomenon is to cite its cause, and this frequently involves deducing (i.e. inferring) the phenomenon from a law. To see how this can be, let us recall the account of physical laws given in §1.3.2. The fundamental concept is that of the generalised lagrangian, I, of the system in which the phenomenon exists. The lagrangian represents the intrinsic dynamical nature of the system, and thus encodes such properties as geometrical configuration, masses, charges, elastic constants and so on. Now, from the above discussion of pendula30, it is clear that I describes one of the causes of the system's behaviour. Thus, if one infers a phenomenon from I, then one infers that phenomenon from one of its causes. Explanation can then be cashed out as follows: To explain a phenomenon is to infer the phenomenon from (a description of) its causes. Thus, since a physical law describes the dynamical nature of a system, which is one of the causes of its behaviour, one kind of 2 8See for instance Hempel (1965), Ernest Nagel (1961:29-46) 2 9 T h e explanans may also include statements about the antecedent conditions. 3 0There is, of course, nothing special about a pendulum in this regard. The same conclusion follows from considering any other type of system whose dynamics is understood. 3 6 explanation is to infer the phenomenon from a law. There are other kinds of explanation, however, such as inferring a phenomenon from the boundary condition. 1.5.6 Overview of the Thesis Since the determination relation is logical entailment, relative to I, partial determination must be partial entailment relative to I. The causal theory thus requires there to be such a thing as partial entailment, or logical probability. The consensus in many quarters today is that there is no probability function which is determined by logic alone, so the first task is to show that this view is mistaken. In Chapter Two I attempt to demonstrate that logical probability is still very much a viable idea by giving an account of it. In Chapter Three the causal theory of chance is outlined in detail, and compared to its rivals. It will be shown that the causal theory entails Miller's Principle, that knowledge of the chance of an event E warrants a numerically equal degree of belief that E occurs. From this result it follows that, according the causal theory, chances can be inferred (albeit approximately and fallibly) from empirical observations of relative frequency. This conforms perfectly to what we ordinarily suppose about chance and frequency. The remaining three chapters concern the application of the causal theory to some outstanding problems in theoretical physics. The purpose of these chapters is twofold. First they provide support for the causal theory itself, since they demonstrate its fruitfulness in problem solving. They give indirect empirical evidence that stochastic events are caused but not determined. Second, since the problems considered are themselves important, the solutions provided are of independent interest. In Chapter Four I set up a mechanical formalism for stochastic processes. The main purpose of this is to act as a foundation for the later chapters on correlation and measurement, but it is also used to tackle the problem of the direction of time, which I see primarily as one of accounting for temporally-asymmetric physical phenomena. In particular I focus upon Reichenbach's common cause principle as a general example of such a phenomenon. The 37 formalism is based upon five assumptions about physical systems which seem reasonable, at least at large scales, and from these the required phenomena may be derived. In Chapters Five and Six I analyse some problems in quantum mechanics from the point of view of the causal theory. This involves a detailed examination of the dual nature of chance. According to the causal theory, the chance function supervenes on the physical properties of the system concerned, and thus is itself a physical feature of the system. Since, however, it is defined using logical probability, it also turns out to be an epistemic probability function. These are the two faces of chance; it is entirely physical, yet it also represents rational degrees of belief within a particular possible epistemic state. A second feature of the causal theory, which also assumes central importance in understanding quantum chances, is that the knowledge "possessed" by the chance function is incomplete, even though it is maximal. In Chapter Five I develop a theory of correlation which explains the counter-intuitive results obtained in EPR-type experiments. I show how these correlations, which cannot be used to transmit information, do not require any causal link between the separated systems. In other words there is no need to drop Einstein's locality principle of special relativity, that causal influences cannot travel faster than light. We are forced to drop another locality principle, however, that a maximal state is always local, in the sense that the maximal state of a pair of systems can always be split into two separate states, one for each system. In Chapter Six the theory of quantum correlation developed in Chapter Five is applied to the more difficult problem of interpreting the state vector, and particularly the way it changes when a measurement occurs. The state vector is closely related to the chance function, and shares the latter's dual nature, being both physical and epistemic. 38 2. Logic and Probabi l i ty The aim of this chapter is to argue that there is such a thing as logical probability. The conception of logical probability I defend is somewhat different from previous accounts, however, so why do I call it "logical probability"? The essential notion of logical probability, as pioneered by Keynes (1921), is captured by the following three ideas: (1) As the name suggests, logical probabilities are probabilities that are, in some sense, part of logic. (2) A logical probability is a degree of entailment of one proposition by another. Thus a logical probability represents a logical relation between two propositions. (3) A logical probability is some sort of degree of belief. The rough idea is that the logical probability of A given B, which we write as Pr(A \ B), has something to do with the degree to which A should be believed, for someone who knows only that B holds. Since my account of logical probability agrees with that of Keynes on these points, the use of the term 'logical probability' seems appropriate. These three strands are far from independent, of course, as (2) fleshes out what is meant by (1), and (3) does the same for (2). Since entailment is a logical relation par excellence, it follows that degrees of entailment must be part of logic as well. What, however, is meant by a "degree of entailment"? We obviously require an account of partial entailment according to which ordinary entailment is a special case of partial entailment. The idea of a degree of entailment as a conditional degree of belief seems, at least, to be capable of providing such an account. If B entails A fully, then an ideal thinker who knows B with certainty will also believe A to the greatest possible degree. Thus we see that the notion of logical probability is most fully captured by (3). 39 I realise that some may find the following account of logical probability unconvincing, and so it may be wondered which features of logical probability are required for the rest of the thesis. Apart from the obvious requirements, such as satisfaction of the axioms of probability, these are as follows. (i) Logical probability is not anthropocentric (so that chance is not anthropocentric). It is defined on objective states of affairs rather than human thoughts. (ii) The Authority of Logic principle of §2.2.3 holds. These two properties of logical probability are needed to ground the two main features of physical chance, (i) is needed for chance to be an objective, physical quantity, and thus not dependent upon human beings, and (ii) is required for Miller's principle, i.e. the Authority of Chance principle, that knowledge of the chance authorises a numerically equal degree of belief. If another account of logical probability were found that also satisfied (i) and (ii), then it would serve the thesis equally well. 2.1 The Objections to Logical Probability Logical probability has been in the academic wilderness for some decades now, and many consider that its rightful place. Before I begin to investigate it, and examine its properties, I therefore should explain why I think there is at least some hope of giving a satisfactory account of it, despite the failure of greater minds to do this. Perhaps the main reason why the prospects for logical probability seem so dim is that previous attempts to characterise it have failed, and in ways that seem quite fatal. The principal difficulties are (I) Measures that are arguably logical in character do not support learning from experience, and 40 (II) Logical probability seems to require an indifference principle, but these all lead to contradictions. In addition to these practical problems that arise when trying to construct a theory of logical probability, there are also two more theoretical objections to the concept itself, namely (HI) The logical probability function is too opinionated to be part of logic. The number Pr(A \ T), where T is a tautology, is an exact degree of belief in the possible state of affairs A on the basis of no information at all! Logic, however, is independent of matters of fact, and so should have no such opinions. (IV) The idea of logical probability violates the distinction between logic and psychology. A probability is a degree of belief, and belief belongs to psychology rather than logic. Logic is concerned with truth, not belief, and there are no degrees of truth. These problems, as a whole, may look quite severe, but they will all be answered in this chapter. Objection (II) is met by formulating a symmetry axiom, proving it from the definition of logical probability, and showing that it is free of contradiction. The symmetry relation involved, which exists between some pairs of states of affairs, certainly appears to be logical in character. Objections (I) and (IV) are the most interesting, as they concern the relation of logic to human thought. Problem (I) assumes, for instance, that logical probability is involved in the rationality of scientific inferences. This assumption was certainly made by the two main investigators of logical probability, J. M. Keynes (1921) and R. Carnap (1950). Keynes, for instance, summarises the notion of logical probability as follows: Part of our knowledge we obtain direct; and part by argument. The Theory of Probability is concerned with that part which we obtain by argument, and it treats of the different degrees in which 41 the results so obtained are conclusive or inconclusive. ... Given the body of direct knowledge which constitutes our ultimate premises, this theory tells us what further rational beliefs, certain or probable, can be derived by valid argument f rom our direct knowledge. This involves purely logical relations between the propositions which embody our direct knowledge and the propositions about which we seek indirect knowledge, (pp. 3,4) Keynes believed, in other words, that the inductive inferences of natural science can be analysed using logical probabilities. He is quite explicit about this, as his main aim in the Treatise is to lay a firm, logical foundation for scientific reasoning. In science one often claims that a theory H is supported by experimental data E, that it is valid to accept the theory in the light of the empirical evidence. In Keynes' view this simply means E entails H to some degree, so that Pr(H \ E) is reasonably high.1 This view of Keynes, that warranted scientific theories have high logical probabilities given the empirical data, is now known to be false. The basic problem is neatly illustrated by the example of predicting the colours of balls as they are drawn at random from an urn.2 Suppose we know that an urn contains N balls, some of which are black and the rest white, in unknown proportion. Some balls are drawn randomly from the urn, all of which are found to be black. What is the correct degree of belief that the next ball drawn will also be black, in the light of this evidence? It should surely be greater than that of the first ball's being black, if we are to learn anything from experience. The possibility of such learning depends, however, on the prior probability function. It will be useful to compare two different prior probability functions that have been discussed extensively. Though they are quite different, they both display uniformity, being 'This claim, notoriously, leads to the problem of the priors. In order to calculate Pr(H | E) one has to use Bayes's theorem, for which one already needs the values of terms like Pr(E \ H) and Pr(H). Unfortunately the terms like Pr(H), the infamous prior probabilities, do not seem to exist. 2 Th is discussion is based on Howson and Urbach (1993: 59-72). 42 based on assignments of equal probability. The first, which is an instance of Carnap's measure ct (Catnap, 1950), assigns the same probability to each possible constitution of the urn. A possible constitution is a specification of the colour (white or black) of each of the N balls, so that there are 2N possible constitutions. This measure is considered by some (such as Keynes) to be logical in character, but it does not permit any learning from experience. Since Bayesian conditioning preserves the relative probabilities of any two hypotheses consistent with the data, if we start with this measure then the probability of next ball's being black remains always at 1/2, regardless of the data. I myself do not find this measure to be purely logical in any case, since there is no perfect symmetry between the different constitutions. At best, there is symmetry only between different constitutions of equal frequency (of black balls, say). The second probability function assigns the same probability to each possible frequency of black balls in the urn, so that each of the N+l possible frequencies has probability 1/(^ +1). Within a class of constitutions of the same frequency, each constitution has the same probability. Under this measure, which is an example of Carnap's measure c*, some constitutions are far more likely than others. In particular, constitutions whose frequency is close either to 0 or to N+l have high probability in comparison to those with roughly equal numbers of black and white balls. This prior expectation of uniformity in the colours of the balls leads one to infer that the next ball drawn will likely be the same colour as the majority of those drawn already. Indeed, it is simple to derive Laplace's rule of succession here, that if m of the first n observed balls are black, then the probability that the next ball is black is (m+l)/(ra+2). This second measure, however, is certainly not logical in character. It cannot be justified in terms of symmetry, for example. This example, though simple, is a fine illustration of scientific reasoning. In order to make scientific inferences, one must make strong background assumptions about the nature of the world. In short, one assigns higher prior probabilities to hypotheses that seem plausible, and are in accordance with good sense. The question of whether these background assumptions, or prior probabilities, are valid is beyond the scope of this thesis. The point I wish 43 to make is that, valid or not, they fall outside the domain of logic. The prior probabilities used in scientific reasoning are not logical probabilities, but rather epistemic probabilities, i.e. rational degrees of belief for humans. The reason why human good sense is not part of logic will be made clear in the next section, where the relation between logic and psychology is discussed in detail. Objection (HI) is dealt with rather simply by pointing out that Pr(A \ B) is not defined as a single number for all states of affairs A and B. In general, as discussed in §2.6, a logical probability is a sub-interval of [0,1]. This is why Keynes stresses the relational nature of logical probability (1921: 6-7). In order for a precise value of Pr(A \B) to exist, B has to contain some information that is relevant to A, which means that B is not a tautology. In general, Pr(A | 7) will be a wide interval, perhaps even [0,1] itself. Logical probability satisfies King Lear's dictum: "Nothing will come of nothing". 2.2 The Nature of Logical Probability I hope that the discussion so far has begun to make the idea of partial entailment seem more plausible. Now I will try to explain more precisely the nature of this relation. To do this, it will be necessary to discuss some general views about logic. The essential point is that a logical probability is some sort of degree of belief, and it is currently held that the concept of belief has no place in logic. I shall argue, however, that belief (or, more precisely, epistemic state) is actually the central concept in logic. 2.2.1 Bedeutung and Sinn Why is the notion of belief excluded from logic? It is arises from the distinction, most forcefully defended by Frege (1884), between logic and psychology. The first of the three "fundamental principles" of the Grundlagen is "always to separate sharply the psychological from the logical, the subjective from the objective" (1884: X). Thus the concept of belief, 44 which belongs to psychology, must sharply be distinguished from its logical counterpart, namely truth. Logic studies the laws of truth, whereas psychology studies belief and thought. Frege's insistence on the separation between logic and psychology was prompted by the emergence of psychologism in the study of logic, over the previous two centuries or so. The term 'psychologism' has been used in a variety of senses,3 but the basic idea is that logical relations can be reduced to psychological relations. The subject matter of logic is the human mind, and the particularly way it forms concepts, makes judgments, performs inferences, and so on. Frege is implacably opposed to this invasion of psychological terms into logic and the foundations of mathematics, so that the first several pages of the Grundlagen are devoted to a forthright, sarcastic denunciation of it. He writes for instance: A proposition may be thought, and again it may be true; let us never confuse these two things. We must remind ourselves, it seems, that a proposition no more ceases to be true when I cease to think of it than the sun ceases to exist when I shut my eyes. Otherwise, in proving the Pythagorean theorem we should be reduced to allowing for the phosphorus content o f the human brain; and astronomers would hesitate to draw any conclusions about the distant past, for fear of being charged with anachronism,—with reckoning twice two as four regardless of the fact that our idea of number is a product o f evolution and has a history behind it. I t might be doubted whether by that time it had progressed so far. How could they profess to know that the proposition 2 x 2 = 4 already held good in that remote epoch? Might not the creatures then extant have held the proposition 2 x 2 = 5, f rom which the proposition 2 x 2 = 4 was only evolved later through a process of natural selection in the struggle for existence? Why, it might even be that 2 x 2 = 4 is itself destined in the same way to develop into 2 x 2 = 3! Est modus in rebus, sunt certi denique fines'. (1884:V-VI) 3 R o l f George (1997) identifies four distinct senses. 45 The distinction between logic and psychology, i.e. truth and belief, is of the greatest importance, in my view, and an excellent place to begin an enquiry into the nature of logic. It is well known that a sentence4 has two properties: it expresses a thought, and it has a truth value. A sentence is true just in case it corresponds to an actual state of affairs in the concrete world, and is false if it corresponds to some state of affairs that does not obtain. Thus every sentence, if it is meaningful, corresponds to some state of affairs, which may or may not be actual. Indeed, we may say that the state of affairs described by a sentence is the meaning of the sentence.5 In human terms a sentence is used for communication, to express a belief, or rather the content of a belief. Following Frege, I shall call the content of a possible belief a Gedanke. Since two sentences may express the same belief, or have the same content, it is possible for two sentences to share the same Gedanke. If two Gedanken are equal, then it is not possible for a (properly-functioning) human to believe one and not the other, at a single time. Like Frege, I assume that two humans may think the same (type of) Gedanke, and that to understand a sentence is to grasp the Gedanke it expresses. One might be tempted to say that a Gedanke, or belief content, is the thing believed when one has that belief. This would be a grave mistake, however. To explain this point I shall adapt one of Frege's illustrations (1892: 60). Suppose one looks at the moon through a refracting telescope. The objective lens of this instrument forms a virtual image of the moon just in front of the eye lens. When one looks into the telescope, it is the moon that one observes, the actual celestial body. The image inside the telescope is, however, a necessary part of the physical process of observation; one might say that the moon is presented to the observer via this virtual image. In a similar way, consider a belief that the moon has no atmosphere. This belief concerns the external world, i.e. the moon itself, and not any psychological entities 4 B y 'sentence' I mean a declarative sentence, i.e. a sentence that has a truth value. 5 Frege did not say this, but instead held that the meaning of a sentence is its truth value. I argue for my view over Frege's below. 46 such as thoughts. The thing that is believed here is the possible objective state of affairs that the moon has no atmosphere. A belief, however, is a psychological event that occurs within a human mind, and a Gedanke is one necessary component of any belief. It is analogous to the virtual image within the telescope, that is a necessary part of the physics of the process of observing with a telescope. A Gedanke is the manner in which a state of affairs is presented to a human mind. We see then that a sentence both expresses a thought, or Gedanke, and also corresponds to a state of affairs. The Gedanke is a psychological entity, a feature of the human mind, whereas a state of affairs is objective, or "inhuman", i.e. a feature of the external world. The logic/psychology distinction therefore matches the state of affairs/Gedanke distinction. States of affairs are part of the realm of truth, or objective reality, whereas Gedanken belong to the realm of human belief. Why is it necessary to make this distinction, however? Why not identify these two entities (as they perhaps look rather similar) and simplify the picture? (One might call the single entity a "proposition".) The necessity of this distinction is proved by Frege, however, in his monumental paper "liber Sinn und Bedeutung" (1892b), with an argument that I will now summarise. Frege begins by noting that equality, or identity, gives rise to questions that are not easy to answer. He observes that an identity statement 'a=&' must refer to the subject matter, i.e. the objects denoted by V and 'b', rather than the written symbols, or it would not express any proper knowledge of the world. In that case, however, an identity statement seems to be rather trivial, as it merely says that an object is identical to itself. Yet some identities are far from trivial, or obvious, being important discoveries. For instance, Phosphorus (the morning star) is equal to Hesperus (the evening star). This was an important astronomical discovery, requiring empirical observations, and cannot be known by internal reflection alone. How can this be? This identity is far from trivial, Frege argues, because the two names 'Hesperus' and 'Phosphorus' are associated with different manners of presentation of the single planet Venus. Venus presents itself to us humans under two different guises, sometimes as a morning star and 47 sometimes as an evening star, so that it is not easy to tell that it is the same planet in each case. The manner of presentation associated with a linguistic unit, such as a proper name, Frege calls a Sinn, whereas the real meaning of that unit is its Bedeutung. Thus, in the Venus example, the names 'Hesperus' and 'Phosphorus' have the same Bedeutung (the planet Venus) but distinct Sinne. Note that, while a Bedeutung is an external, inhuman object, a Sinn is some sort of psychological entity, being bound up with the manner in which an object presents itself to humans. Now let us consider the two sentences: (1) Hesperus is Hesperus (2) Hesperus is Phosphorus These two sentences have the same meaning, in that they correspond to the same state of affairs. (One often says that they are "true in the same class of possible worlds".) They do not express the same Gedanke, however, as someone might, without any mental dysfunction, believe (1) but not (2). At the level of Gedanken, (1) is a trivial tautology, whereas (2) is a significant statement of fact. This point is illustrated by considering the following two sentences: (3) Abraham believes that Hesperus is Hesperus (4) Abraham believes that Hesperus is Phosphorus It may well be that (3) is true and (4) is false. It is clear that, for Abraham at least, the two thoughts are quite distinct, even though they correspond to the same external state of affairs. If the two Gedanken were equal, then it would be possible to substitute one for the other, even in a belief context, without any change of truth value in the whole sentence. 48 The distinction between Gedanke and state of affairs seems to be an instance of Frege's general distinction between Sinn and Bedeutung. The Bedeutung of a linguistic unit is its real meaning, its significance in the external world. The Sinn of a linguistic unit contains the manner in which the Bedeutung the unit is presented to humans, under that unit. Thus the Bedeutung of 'Hesperus' is the planet itself, while the Sinn of 'Hesperus' is determined by the manner in which Venus presents itself to us under the name 'Hesperus', i.e. as a bright light visible above the western horizon after sunset. In a similar way, the Bedeutung of a sentence seems to be the state of affairs it represents, while the Sinn is the manner in which that state of affairs is presented to us humans, as a Gedanke. The sentences (1) and (2) present the same state of affairs under two different guises. This identification of the Bedeutung of a sentence with the state of affairs it represents is Fregean in spirit, but not in detail. In what I consider to be his greatest mistake, Frege argues that the Bedeutung of a sentence is its truth value. What is his argument for this? Frege's premise is that the Bedeutung of a sentence should not depend upon the Sinne of its sub-units (such as proper names) but only on their Bedeutungen. This premise is surely sound, as Bedeutungen are all external objects, having nothing to do with human beings. Then consider the sentences: (5) Hesperus has no moon. (6) Phosphorus has no moon. Whatever the Bedeutung of (5) may be, it must be equal to that of (6), since (6) is obtained from (5) by substituting 'Phosphorus' for 'Hesperus', whose Bedeutungen are equal. Now Frege notes that the truth values of (5) and (6) are equal, and indeed that truth values must always be preserved under such a substitution. He then asks, rhetorically, "what feature except the truth-value can be found that belongs to such sentences quite generally and remains unchanged by substitutions of the kind just mentioned?" (1892b: 64-65). For some reason that 49 is unclear to me, Frege does not seem to notice that the state of affairs represented by a sentence is just such a feature. The sentences (5) and (6), for example, correspond to the same state of affairs, as they are true in exactly the same class of worlds. Frege's argument thus fails to rule out states of affairs as the Bedeutungen of sentences. Is there any reason to hold that the meaning of a sentence is a state of affairs, rather than a truth value, however? The following three arguments prove this. First, consider the following true statement: (7) The 1998 World Cup was held in France. Since (5) and (6) are also true it follows that, according to Frege, (5), (6) and (7) all have the same meaning, namely The True. It nonetheless seems that there is a similarity of meaning that exists between (5) and (6) that does not exist between (5) and (7). One would describe this similarity by saying that, while (5) and (6) correspond to the same state of affairs, (5) and (7) represent distinct states of affairs. Frege apparently has no way to mark this important distinction, however. He cannot appeal to Sinn here, as the Sinne of these three sentences are all distinct. A second argument proceeds from the premise that the sentence is the basic linguistic unit, and the fundamental carrier of meaning, since it is the smallest unit that "says something". In that case, the meanings of smaller units, such as proper names and predicates, should be definable in terms of the meaning of a sentence. This is just not true if the meaning of a sentence is its truth value - the planet Venus does not seem to be a component of The True, for example. Venus does look as if it is a component of the state of affairs that Venus has no moon, on the other hand. It might even be possible to define Venus as the common component of all the states of affairs concerning Venus. This would explain why there is no possible state of affairs in which Venus is not identical to Venus - it is because Venus, as an object, is constituted by being a common component of distinct states of affairs. In a similar way, a 50 Begriff (or concept, i.e. the meaning of a predicate) may perhaps be abstracted from an equivalence class of states of affairs. The third argument is that logic should only require Bedeutungen. The primary logical relations, namely consistency and entailment, are not relations between truth values. Does the True entail the True? Is the False consistent with the False? These questions are meaningless. These relations do hold between states of affairs, however. One state of affairs can entail, or include, another. For example, the state of affairs of Phosphorus's having no moon includes the state of Hesperus's having fewer than two moons. To make these relations of consistency and entailment part of logic, Frege is forced to regard the Sinne as part of logic rather than psychology. This is perhaps the reason why he is so adamant that Sinne are objective rather than subjective, and sharply to be distinguished from mental ideas. In view of Frege's account of a Sinn as a manner of presentation (to humans, presumably) this placement of Sinne on the logic side of the logic/psychology (truth/belief) division is rather implausible. In particular a Gedanke, or thought, is clearly a human thought, as who else's thought could it be? Thus Sinne are closely tied to humans, and hence to psychology. 2.2.2 Entailment and Probability It is stated in the introduction that a logical probability is a degree of entailment, so we must try to understand the entailment relation. It is generally thought that entailment is a relation between propositions, but what are propositions? Sometimes "proposition" means a declarative sentence, but sentences themselves cannot directly entail other sentences. Rather, the proposition expressed by one sentence may entail that expressed by another sentence. In the previous section we saw the need to distinguish between two different kinds of sentence meaning, namely Gedanken and states of affairs. Perhaps either Gedanken or states of affairs are propositions? It seems to me that the term 'proposition' is sometimes used to mean a Gedanke, and sometimes a state of affairs, without a clear distinction being made between these two. Indeed, 51 if we wish to use the term 'proposition', we should perhaps talk of "logical propositions" (objective states of affairs) and "psychological propositions" (human belief contents). There are plenty of examples of the term 'proposition' being used in each of these two ways. A proposition is often said to be defined by a class of possible worlds, namely the set of worlds in which it holds true. In this case, a proposition is indistinguishable from a state of affairs. At other times, however, a proposition is said to be the content of a (presumably human) belief.6 If there are two kinds of proposition, the logical and the psychological, then we might expect there to be two matching kinds of entailment, which seems to be the case. Consider, for instance, the sentences (5) and (6) above. Does (5) entail (6)? At the level of Bedeutung they have the same meaning, so that each trivially entails the other. The state of affairs of Phosphorus's having no moon is necessitated by Hesperus's having no moon. At the level of Sinn, however, it seems that neither entails the other. It would certainly be invalid for a human to infer (6) from (5) since, as far as one knows, (5) could well be true and (6) false. We should not imagine, of course, that these two levels of entailment are quite disconnected. After all, a Gedanke is just a manner in which a human mind grasps a state of affairs. Entailment at the level of Gedanken is a rather messy affair, as one has to take account of human limitations, including the following. First, a single state of affairs may have different probabilities (degrees of rational belief) under different Gedanken, within a single epistemic state. Second, there is no guarantee that Gedanken defined by logical operations will exist. There may be two Gedanken A and B that are consistent, and yet no conjunction A&B exists simply because it is too complicated. The human mind is finite, and cannot grasp states of affairs of arbitrary complexity, so that the class of Gedanken will not form even a Boolean algebra. Third, the same finite nature of the mind surely precludes deductive closure of epistemic states, even as an ideal. 6 Remarkably, these two explications of what a proposition is sometimes even appear together! See, for instance, Plantinga(1974: n . l , p . 45). 52 Entailment at the level of Bedeutung, on the other hand, is much cleaner and "logical", being free of human limitations. The class of states of affairs does (it is assumed) form a Boolean algebra, for example, as for any two consistent states of affairs there exists also a conjunction of them. If logical probability is to be used in an analysis of physical chance, it must be non-anthropocentric, and thus a generalisation of this entailment relation at the level of Bedeutung rather than Sinn. We must therefore find an account of objective entailment, which will henceforth be called just "entailment". As far as I am aware, there have been only two attempts to define entailment. The first is inspired by Tarski's definition of formal entailment (Tarski, 1935), as a relation between sentences of a formal language.7 One sentence O formally entails another, XF, if every interpretation that satisfies O also satisfies *F, roughly speaking. This idea is adapted to the problem of defining real entailment as follows: One state of affairs A entails (or includes) B just in case every possible world that satisfies A also satisfies B. The second attempt to define entailment is a more radical idea, due to Peter Gardenfors (1988: 135). He introduces the notion of an epistemic state into logic, and defines propositions as functions between epistemic states. The proposition A then entails B just in case the composite function AoB equals the function A. We shall see that Gardenfors' idea is by far the more fertile of the two. Let us look first at the Tarski-style definition. The two key terms here are 'possible world', and 'satisfaction', so we must first be clear on what they mean. A possible world is a possible state of affairs, a way things might have been, but not every state of affairs is a possible world. A possible world has the special feature of being maximal, in some sense. The obvious difficulty here is that to be maximal is, apparently, not to be entailed by (or included in) any other state of affairs. The definition of a possible world thus requires a prior understanding of entailment between states of affairs. This is bad enough, but matters get even worse when we consider the satisfaction relation. What is it for a possible world to satisfy a 7Though the subject of formal entailment is interesting, it does not concern us here. 53 state of affairs? A world w satisfies a state of affairs A just in case A necessarily obtains if w does. In other words, w satisfies A just in case w entails, or includes, A! We see that this approach to defining entailment is hopelessly circular. Gardenfors' foundation for logic is based upon the concept of an epistemic state. In Fregean spirit, he emphasises that this is not a psychological notion, but is rather epistemological. It is a state of rational belief for an idealised epistemic agent. (An important part of the idealisation is that every epistemic state is self-consistent and deductively closed.8) In his work, of analysing epistemic expansions and contractions, Gardenfors looks at idealised human epistemic states, however, which are not suitable for our purpose. As stressed above, we are trying to understand entailment at the level of Bedeutung, as a relation between objective states of affairs. An examination of human epistemic states, however idealised, can only shed light upon entailment at the human level of Sinn, as a relation between Gedanken. We will therefore consider the epistemic states of a perfect, infinite intellect, a mind of unlimited capacity that infallibly draws all and only valid inferences. I assume that the Gedanken of this being are (or are indistinguishable from) states of affairs themselves, so that the Sinn/Bedeutung problem does not arise. For this reason I will treat states of affairs as components of epistemic states, and definable from epistemic states, rather than as separate entities.9 The analysis of states of affairs and the entailment relation roughly sketched below is my own, although it borrows heavily from Gardenfors (1988: 132-145). Gardenfors is interested in changes of epistemic state with time, which fall into two basic kinds, called expansions and contractions. (A third kind of change, called a revision, may be understood as a contraction followed by an expansion.) An expansion occurs when one acquires extra knowledge, or learns new information. A contraction occurs when one discovers 8 Clear ly, the statement that all epistemic states are deductively closed is part o f the informal explication of epistemic states, and not a rigorous definition. The use of epistemic states to define entailment would otherwise be circular. 9State of affairs are dependent on these ideal epistemic states in the same way that Gedanken are caused1 by human beliefs. 54 a mistake of some kind, and so gives up previously-held beliefs. It is important to understand that when new information acts as a defeater for some previously-held beliefs, reducing their probabilities, this is an expansion rather than a contraction. A contraction occurs only when one decides that the old beliefs were mistaken even on the information one had at the time. Both expansions and contractions are possible for human beings, but contractions can never occur for a perfect intellect, so we only need consider expansions. The epistemic states can be seen as the points of a kind of logical space. States of affairs are present in this space, but not as independent entities. Rather, they are vectors that take you from one epistemic state to another. The rough idea is that a state of affairs is a carrier of information, and so the vector for a state of affairs A represents the change of knowledge due to learning that A is actual.' In this picture, the fundamental logical relations are those between epistemic states, with relations between states of affairs defined in terms of them. The chief logical relation between epistemic states, which is the root of the entailment relation, I call superiority. The rough idea of superiority is that a state K is superior to K', which we write K > K', just in case K knows at least as much as K', i.e. nothing is known in K' that is not also known in K. In Gardenfors' terminology, K> K' just in case K is an expansion of K'. Alternatively we may say that K > K' if it is possible to move from K' to K. Since knowledge cannot be lost, only gained, it is only possible to move from one state to a superior one. Note that each state is superior to itself, as a "move" from A' to A' is trivially possible. If K is superior to K', then we shall say that K' is inferior to K. To define states of affairs in terms of epistemic states, we first need the notion of a maximal epistemic state. A maximal epistemic state is a state Kw for which there does not exist any K' (other than Kw itself) such that K' > Kw. What, intuitively, does a maximal state look like? It should be a state of certainty about every possible state of affairs A, i.e. one believes either A or its negation with certainty. Such an epistemic state is a "learning terminus": after one gets there, one can move nowhere else. It is a state that cannot be expanded further. Kw is 55 therefore the epistemic state where it is believed with certainty that some possible world, say w, obtains. It is shown below that possible worlds can be defined from maximal epistemic states. A general state of affairs A will be abstracted from the epistemic state KA, where KA is the minimal state in which A is fully believed. How can we define KA, without invoking A? We first define a disjunction of epistemic states, as follows. Definition The disjunction of states {Kx, K2,Kn} is the maximal state that is inferior to each of Kx, K2,Kn. The disjunction of Kx with K2 is written KxvK2. This matches the usual truth-functional definition of disjunction, as the disjunction of two sentences is their strongest common consequence. (Note how there is a disjunction relation between epistemic states, that is ultimately used to define disjunction between states of affairs, and also that disjunction is defined in terms of superiority.) We then define pure epistemic states: Definition (i) Every maximal epistemic state is pure. (ii) The disjunction of a set of pure epistemic states is also pure. (iii) No state is pure unless (i) and (ii) together entail that it is. What, intuitively, is a pure epistemic state? It is generally recognised that a state of affairs is uniquely determined by the set of all possible worlds that include it. This is basically the idea that the meaning of a sentence is determined by its truth conditions, i.e. the conditions under which the sentence is true. Now consider a pure state K that is the disjunction of maximal states Kx, K2, Kn, corresponding to belief in worlds wx, w2, wn, and let the state of affairs A be the disjunction of those worlds. Since each of wx, w2, wn entails A, it follows that A is believed with certainty in each of the states Kx, K2, Kn, and hence A is believed with 56 certainty in K. Now, is there any other state K' < K, such that A is also believed with certainty in K'l Such a K' would also be inferior to each of K{, K2, Kn, since inferiority is transitive. Within the state K' one is certain that one of the worlds w{, w2, wn obtains, as these are the only worlds consistent with A. It follows from this, however, that K' > K, since in K one merely knows (for certain) that one of the worlds w{, w 2 , w n obtains. Thus K' > K, and K' < K, so K' = K, and we see that K is the minimal state in which A is believed with certainty. We can write this state as KA. Each pure state can be used to define a state of affairs, as is shown below. Before we can define states of affairs, we need some preliminary definitions. Definition A class of epistemic states is consistent iff there is some epistemic state that is superior to all its members. Definition The conjunction of {Kx, K2,Kn} is the minimal state that is superior to each of K{, K2, Kn. The conjunction of K{ with K2 is written K{ & K2. Clearly, if the class {Kt, K2, Kn] is not consistent, then these states do not have a conjunction.10 This is in contrast to conjunction as an operation on sentences, where the conjunction always exists. The essential difference is that, while there are sentences (0&->0, for example) that are inconsistent, there are no inconsistent epistemic states. Each epistemic state could be the epistemic state of an ideal intellect. In Gardenfors' account a state of affairs is a function, although he does not specify which one exactly. Since the function represents the expansion upon learning that A obtains, it makes sense for the domain of A to be the class of epistemic states consistent with KA, and for A to map each such state K to K&KA. Under this definition, A entails B, i.e. KA > KB, just in case AoB = A, so his proposed account of entailment works. We shall have occasional use for these 1 0 W e could introduce a (fictitious) absurd epistemic state, to ensure that conjunctions always exist, so that the pure epistemic states form a Boolean algebra. 57 functions, which I call Gardenfors functions, but a better way to define states of affairs is as follows. Consider some epistemic state K that is consistent with KA. In general, when the Gardenfors function A is applied to K, resulting in K&KA, the actual expansion (or learning) involved is rather less than A, since K might already have some of the information carried by A. In the extreme case where K > KA, for example, no learning takes place at all, as A{K) = K. If we want to talk about the actual learning that takes place on a particular expansion, therefore, the Gardenfors function does not help. One expansion where the full learning of A takes place is if one moves to KA from an initial state of no information at all. We define this state as follows: Definition The epistemic state K0 is the disjunction of every epistemic state. It is clear that K0 is the minimal state, being inferior to every epistemic state. I assume that it is also equal to the disjunction of all the maximal states, and thus a pure state. In that case, it defines a state of affairs O, whose truth set is the class of all possible worlds. O, in other words, is the unique necessary state of affairs. Any sentence that is necessarily true, such as "Hesperus is Phosphorus", has O as its Bedeutung. Again, we see a contrast between symbolic logic and real logic. In symbolic logic there are many tautologies, whereas in real logic there is only one "tautology" O. We now define states of affairs. Definition The state of affairs A is the expansion from K0 to KA. Corollary The possible world w is the expansion from KQ to Kw. We can now define entailment, and the usual logical connectives, between states of affairs. 58 Definition A entails B, i.e. A => B, just in case KA > KB. Since states like KA & KB are easily shown to be pure, we can define the usual logical connectives in the obvious way, viz.: Definition KA&B = KA& KB. KAvB = K A V KB> e t C - Definition Let t(K), the truth set for K, denote the class of possible worlds consistent with K. (Note that K may be any epistemic state, and does not have to be pure.) Theorem If K' > K, then t(fC) e t(K). Proof: Suppose that K' > K, but that for some world w e K',w <£ K. Then K knows that w is not the case, whereas K' does not know this. Thus, contrary to the hypothesis, K' is not superior to KM Corollary If A is believed with certainty in K, and K' > K, then A is believed with certainty in K'. Proof. Since A is certain in K, it follows that A is entailed by each member of t(K). Then, since K' > K, it follows that t(K) cz t(K), and so A is entailed each member of t(K'). Since A obtains in every world consistent with K', A must be certain in K'M Theorem A entails B iff B is believed with certainty in KA. 59 Proof: (i) If A entails B, then KA > KB. B is believed with certainty in KB, and hence in KA as well, since it is superior to KB. (ii) If B is believed with certainty in KA, then KA > KB.K We see that the definition of A => B, that KA > KB, is equivalent to the statement that B is certain in KA. Now, at the human level, belief is a matter of degree, so that there are grades of certainty with regard to a state of affairs, ranging from firm belief down to firm disbelief. A human epistemic state is not dogmatic with regard to every state of affairs; on some matters one is sceptical, tentative or unsure. It is plausible to suppose, therefore, that the same is true of ideal epistemic states. Consider, for instance, the state of affairs A, that Venus has either one moon or none, and B, that Venus has no moon. Within KA, is B believed or not? It would be irrational for B to be believed with certainty, and also for ->B to be believed with certainty, as neither of these is entailed by A. The correct attitude within KA is surely some partial belief in B. Unless epistemic states can include such partial beliefs, it is impossible to regard them as a rational ideal. If an epistemic state can include partial beliefs, however, then there is a natural way to generalise the entailment relation to include degrees of entailment. Definition A entails B to degree p iff B is believed to degree p in KA, i.e. Pr(B | A) is the degree to which B is believed in KA. 2.2.3 Truth and Authority One of the maximal epistemic states, KT say, has a rather special relationship with the real, concrete world. It does not make sense to say that KT is true, since it is itself the standard for truth and falsehood. The actual world, T, is merely the expansion from K0 to KT, and the real world is an embodiment, or concrete version, of T. KT is the formal cause of the real world, and the defining characteristic of true sentences and beliefs, so we may call it the truth. It might 60 seem a little extreme to define the truth as an epistemic state11, but it is the only way I can see to understand the relation between belief and truth. The truth/belief distinction is just the Bedeutung/Sinn distinction. (It is interesting that postmodern thinkers, who reject the very notion of truth, use "truth" and "belief as synonyms, talking about "your truth" and so on.) Definitions (i) K is veridical iff it is consistent with KT. (ii) A state of affairs A is actual just in case KA is veridical. (iii) A true sentence is one whose Bedeutung is an actual state of affairs. Suppose one knows for sure that the state of affairs A obtains. Moreover, A in fact entails some other state of affairs B. One may well not believe that B obtains, as one fails to see that B follows from A. Perhaps one still cannot see the entailment even when it is pointed out and explained. If, however, one somehow learns that A does include B, then one can infer B from these two premises: A, and A => B. (This is quite a different inference from deducing B directly from A as a single premise.) In the two-premise inference, one is effectively submitting to the authority of logic, accepting its verdict, whereas in the one-premise inference one does the thinking for oneself. Both inferences are valid, of course. Logical relations of all kinds seem to be authoritative in this way with regard to belief. If one learns that two states of affairs are inconsistent, for example, then one should not believe firmly in both of them. It is reasonable to suppose, therefore, that logical probabilities are also authoritative. Definition If K > K', and K is veridical, then K is an authority for K'. 1 ' I actually got the idea f rom Plato, however, so its pedigree is not to bad. See the Republic (Cornford, 1941, V I 507-508). 61 This is a rather stringent definition of relative authority; I imagine that some weaker condition would be sufficient, but this definition will serve our purpose. Principle (Authority of Logic) If one knows that K is an authority for one's epistemic state, and knows that A is believed to degree p in K, then one is (defeasibly) authorised to believe A to degree p. This is the foundation for Miller's Principle, which I call the Authority of Chance principle. The authorisation is clearly defeasible, as there may be more than one such authority. The only non-defeasible authorisation occurs when p=l, so that A is known with certainty. 2.2.4 Relative States of Affairs We defined a state of affairs A as the expansion from K0 to K A . Of course, KQ is not the only epistemic state that may be expanded upon learning that a state of affairs obtains. For instance, there is the smaller expansion from KV to KA&U. Definition The relative state of affairs A/U, read "A given U", is the expansion from K u t 0 KA&U-Two distinct states of affairs, A and B, may be such that, for some third state of affairs U, A/U = B/U, i.e. A(KV) = B(KJJ), where A and B are the Gardenfors functions. For example, if U is Smith's being an Albertan, A is Smith's being a farmer, and B is Smith's being a Canadian farmer, then A/U and B/U are the same relative state of affairs. The conjunction operation, for absolute states of affairs, is defined above in Def. This definition cannot be used for relative states of affairs, since there is no epistemic state K A / U , in general. The expansion A/U cannot be applied to the state K 0 , I think, because A/U presupposes U - it simply makes no sense from the point of view of K Q . For example, the 62 relative state of affairs that the Loch Ness Monster weighs 300 tons only makes sense relative to the state of affairs that there is a monster in Loch Ness. In the case of absolute states of affairs, Def. is equivalent to the following: Definition KA&B = A(KB) = B(KA), where A and B are Gardenfors functions. Now, this idea can be applied to relative states of affairs, in cases where the expansion A/U can be applied to a state Kc. Such cases occur when the presuppositions made by A/U are all contained in Kc. For instance, if A is logically independent of U, then A/U presupposes nothing, and can be applied to any epistemic state. Also, A/U can be applied to Kw and any state superior to Ky. Definition KC&^A/U) is the state obtained from applying A/U to Kc. Theorems (i) B&A/B = A&B. (ii) If A => B, then A = B & AIB. Proof. Since AIB is defined as the expansion from KB to KA&B, part (i) is trivial. Part (ii) follows from (i), since if A => B then A&B = AM Relative states of affairs are required later in this chapter, in the definition of symmetry between states of affairs, and again in Chapter 6, where they are essential to the interpretation of the quantum-mechanical state vector. 63 2.3 Measuring Degrees of Belief Having sketched out the nature of logical probability, let us now proceed to giving a precise account of it. The rough definition we have so far is: Pr(A \ B) is the degree to which A is believed in KB. So the first task is to devise a system for measuring degrees of belief. One standard way to define degrees of belief is using gambles, but I prefer to use contracts instead, as they are more flexible. A contract is something like this: I'll give you g if it rains tomorrow. The general form is [g if A], where g is some desired thing and A is a state of affairs. If you possess that contract, then you acquire g if A is true, and nothing (from that contract) otherwise. It is easy to define a gamble on A, with betting quotient p and stake $1, as the purchase of [$1 if A] at the price $p. If the object g is desirable, then clearly the contract [g if A] has some value too, as whoever possesses it might get g from it. Moreover, the worth of the contract depends on the degree to which one believes A - indeed, the value of the contract increases strictly with the degree of belief in A. Thus we can use the value of the contract to measure the degree of belief, i.e. to give "degree of belief a precise meaning. We shall therefore be defining the probability of A given B as the value of a contract. It may seem odd to consider desires and preferences as part of logic, but these need not be human preferences. Indeed, it surely makes sense to use the desires and preferences of the perfect, infinite mind already invoked in §2.2. If this being has beliefs, then why not desires as well? Moreover, we shall see that to define values for arbitrarily-great goods, which are needed to define real-number values, one must consider the preferences of a being of infinite capacity. We require a function, which I call val, that maps each contract to a real number that measures its value, within an epistemic state. What meaning can this number have? In general, 64 one measures12 objects by reference to a standard object, called a unit. Mass, for example, is measured by reference to a standard massive object, as the number of instances of the unit required to equal the body in question with respect to mass. In a similar way, the rough idea of val(g) is the number of units required to equal g with respect to preference. Let us begin by describing the different preference relations. The fundamental preference relation between desirable objects a and b is that a is preferred to b, which is written a > b. 2.3.1 Definition a > b just in case there is no objection to giving up b in return for a. Note that the relation is reflexive, i.e. a is preferred to a. This relation of preference can be used to define three further preference relations, as follows. 2.3.2 Definition (i) a> b, i.e. a is strictly preferred to b, iff a > b but not b > a. (ii) a ~ b, i.e. a and b are preference equivalent, iff a > b and b > a. (iii) a <> b, i.e. a and b are preference incommensurable, iff neither is preferred to the other. Let us denote the unit desired thing by '$'. (We will call it "dollar", but its true identity is not relevant.) In order for it to be suitable as a unit, it must be reproducible, so that there can be arbitrarily many other objects $,, $2, all similar to $, so that $, ~ $•. One may possess several of these dollars, in which case we shall say that one owns the bundle {$,, $2, ...}. A bundle is a simple collection, or aggregate, rather than a set, so that a singleton bundle is equal to its only member, {$,, {$2, $3}} = {$1, $ 2 , $3}, and so on. 1 2 One must be aware of the two meanings of 'measure'. First there is a specification of what it means for an object to be assigned a number. Second, there is the physical process by which a number is assigned to a particular object. I am using 'measure' in the first sense. 65 If the dollar is to work as a unit of value, it is clearly necessary that equally-sized bundles of dollars be preference equivalent, e.g. {$,, $2} ~ {$3, $4}, as otherwise a cardinal number of dollars would not specify a unique value. In that case one could not say that an object is worth two dollars, without saying which two dollars! We thus postulate that, for any bundle of dollars a, {$,-, a} ~ {$j, a}. Assuming the dollars are all intrinsically similar, this is sure to hold. We will henceforth denote a bundle of n dollars by '$«', as the identity of each dollar is irrelevant. Now let us consider contracts. It makes sense to use contracts for dollars, such as [$r if A], as the object $r always has a precise value, namely r. Indeed, from now on the term 'contract' will always mean a contract of the form [$r if A]. In order for the value function to be linear, in the sense that the value of a bundle (of dollars and contracts) equals the sum of the values of its parts, we need to assume a principle known as independence, which is as follows. 2.3.3 Principle (Independence) If [$r if A] ~ $q, then {[$r if A], a} ~ {$q, a}, where a is any bundle of dollars and contracts. This principle will not hold in general, as the following two examples show. First consider a coffee shop that gives away one $ with each cup of coffee bought there. When ten of these have been collected, they may be exchanged for a cup of coffee. Suppose further that each $ has little or no value in itself, but is only useful as a means for acquiring coffee. In this case, the principle of independence will fail. For suppose that the contract [$6 if A] is worth $3, for someone who has no other dollars. Then, if one already has $4, the contract [$6 if A] will be worth more than $3, as $10 is worth a cup of coffee whereas $7 is practically worthless. Second, suppose that [$1000 if A] is worth $1. It does not follow that 1000 such contracts together (which is equivalent to the single contract [$1,000,000 if A]) is worth $1,000, since it may be that $1,000,000 is more dollars than a person could possibly use in their whole 66 lifetime. In that case one would be foolish to give up $1000 in return for the contract, as $1,000,000 may be worth only about the same as (say) $5000. To avoid the first kind of example, we shall suppose that dollars are desired for their own sake, and for no other reason. Their value is purely intrinsic, so that (intuitively speaking) the value of a bundle of dollars is just the sum of the values of the parts. The second kind of example is actually impossible since we are considering not human desires, which my be saturated, but those of a being of infinite capacity. If [$r if A] ~ $n, then val[$r if A] = n. What about contracts that are not preference equivalent to a bundle of dollars, however? Fortunately, this value function is easily enriched to include quotients as well. If two bs together are worth $1, for example, then each b has value 1/2, and three bs have joint value 3/2, etc. Finally, we can extend the value scale to the positive reals, using the method of Dedekind cuts. For a contract a may be such that, for the value scale defined by $, for every possible quotient value r, either r<a or r>a.n The object a then defines a Dedekind section on the $-value quotient scale, and this cut may itself be regarded as a further possible value on the scale, represented by the obvious irrational number.14 The precise definition of val is as follows, where a is bundle of dollars and contracts. Note that {} is an empty bundle - literally nothing at all. 1 3 Th is is use of ' < ' does not fall within the scope of our previous definit ion, as r is a quotient rather than a desirable object. One may imagine, however, that the symbol r here denotes an arbitrary object of value r. 1 4 A11 values on this scheme are non-negative real numbers. The question of whether there are negative values does not concern us here. 67 2.3.4 Definition The value of a, or val(a), is defined for the unit $ as follows. (I) Iftf ~ {},then val{a) = 0 (II) If {a,, aj~$p, then val(a)-p/q. (HI) For each object a, let Q* = {reQ:(3x)(val(x)=r & x<a} and Q* = {reQ:(3x)(val(x)=r & x>a). Assuming each quotient is the value of some object, <Q*, Q*> is a Dedekind cut. We then define val(a) as the real number represented by <Q*, Q*>. Once again, it is quite clear that we are not concerned with human beings here. It is not plausible that desirabilities for humans are as finely graduated as this. Indeed, why suppose that it even makes sense to talk of such precise values? There is no a priori guarantee that it does make sense15, but let us see what follows from it. It should be noted that, according to Def. 2.3.4, real-number values are not precise, in the sense that val{a) may equal val(b) even if a and b are not preference equivalent. For instance, any object whose value is "infinitesimal" will have the real number zero as its value, even though it is strictly preferred to the empty bundle { } - 1 6 Although the equality of val(a) and val(b) does not entail that a~b, unless val{a) and val{b) are quotients, the converse does hold. 2.3.5 Theorem If val(a) exists, and a ~ b, then val{a) = val(b). Proof: Suppose val(a) exists and that a ~ b; then val(a) is either a rational or an irrational number. Case (i): suppose that val(a) is rational. Then, for some natural numbers p and q, we have that {au aq}~$p. Now, applying independence q times, together with the assumption that a~b, it follows that {b{, bq}~$p, and so val(b) = plq = val(a). Case (ii): val(a) is 1 5 Many logical truths cannot be known a priori. 1 6 This point was brought to my attention by Paul Bartha., 68 irrational. It is clear that a and b will have the same preference relations to all objects of rational value, i.e. if a>x then b>x and so on, and so a and b define the same Dedekind cut of the rationals. Thus val(a) = val(b)M 2.4 The Axioms of Probability We are now almost ready to demonstrate the four axioms of probability17, but first we shall need some definitions and preliminary results. The objects a, b, etc. are bundles of dollars and contracts. It should be noted that the proofs given below are not "Dutch book" arguments, involving the dubious assumption that a set of betting quotients is irrational if it leaves one vulnerable to a Dutch book18, but are direct demonstrations. 2.4.1 Definition Pr(A \ B) = va/[$l if A] within KB. 2.4.2 Lemma val{a, b] = val{a) + val(b), in any state K. Proof. Case (i): val(a) and val(b) are natural numbers, say p and q respectively. We then have, by the definition of val, that a - {$ls $2, %p] and & ~ {$,, $2, $ }. By independence it follows that {a, b} ~ {$j, $2, $p+q}, and thus val{a, b) = p+q, as required. Case (ii): val(a) and val(b) are rational numbers, say p/q and r/s respectively. Then, again by the definition of val, we have {a{, a2, aq] - %p and {bu b2, bs} ~ $r. Then, using Lemma I 7 T h e first three of these axioms were formulated by Kolmogorov (1933). 1 8 T h e assumption of irrationality becomes dubious in the case of bets made at different times. I t means that changing one's mind, without additional information, is always irrational. In a sense this is true, but frequently such changes are corrections of earlier mistakes. Here, it is the original belief, rather than the change, which is irrational, yet it is the change (not the original belief) which exposes one to a Dutch book. 69 2.4.2 in Case(i), we have {au a 2 , a s q \ ~ $sp and [bx, b2, —, bsq) ~ %qr. Again using Lemma 2.4.2 in Case (i) it follows that {{a,, bx], {asq, bsq}} ~ %(sp + qf). Thus, by definition, val {a, b}=(sp+qr)lsq = p/q + r/s, as required. Case (iii): val(a) and val(b) are real numbers r and s, represented by Dedekind cuts <Rt, R*> and <S*, S*> respectively. Let t = r+s, and t be represented by the cut <7"», T*>. Then we are required to prove that val {a, b] is represented by <T*, T>, i.e. that T* is the set of quotient values of objects inferior to {a, b}, and 7* is the set of quotient values of objects preferable to {a, b}. Consider arbitrary f e T*. It is always possible (by the definition of real addition) to find quotients r, s~ in R*, S* such that r+s- = t; then select objects a; b- whose (quotient) values are r and s\ Then, by Lemma 2.4.2 Case (ii), val{<r, Zr} = r+s-, and it is also clear that {a-, b-}<{a, b}. Thus each element of T* corresponds to an object inferior to {a, b), and it may similarly be shown that each member of T* corresponds to an object preferable to {a, b}M 2.4.3 Lemma Pr(A \ B)=p iff ([$1 if A] - $p) within KB. Proof: Immediate from Definition 2.4.1 . • 2.4.4 Lemma [%p if O] ~ $p, in any state K. Proof. K is superior to K0, so O is believed with certainty in KM 2.4.5 Lemma val[$rp if A] = r.val[$p if A], in any state K. Proof. Case (i): r is a natural number. Since [%rp if A] ~ {[$p if A],, [$p if A]r}, we can apply Lemma 2.4.2 to get val[%rp if A] = val[$p if A], + ... + val[$p if A]rM 70 Cases (ii), where r is a quotient, and (iii) where r is real, may be proved using arguments similar to those given in the proof of Lemma 2.4.2. 2.4.6 Lemma If val[$p if A] = val[$q if B], then val[$rp if A] = val[%rq if B]. Proof. Immediate from Lemma 2.4.5.• The axioms of the probability calculus may be shown as follows. The "within KB" part is omitted unless it is manipulated in the proof. Axiom 1 Pr(A\B)>0 Proof. The contract [$1 if A] either yields $1 or nothing. It has to be worth at least zero therefore. • Axiom 2 Pr{0\B)=\. Proof. [$1 if O] ~ $1, from Lemma 2.4.4, and va/($l)=l.B Axiom 3 Pr(AvB \ Q=Pr(A \ Q+Pr(B | Q, where A and B are inconsistent. Proof. val[$l if AvB] = val{[$\ if A], [$1 if B]} = val[$l if A] + va/[$l if B], using Lemma 2.4.2. • Axiom 4 If Pr(B | Q exists and is non-zero, then Pr(A&B \ Q = Pr(A \ B&QPr(B \ Q. 71 Proof. 1. ([$ 1 if A] ~ [$Pr(A | B) if O]) within KB Consequence of Lemma 2.4.3. 2. ([$1 if A] ~ [$Pr(A I B&Q if 0]) within rT B & c Consequence of Lemma 2.4.3. 3. ([$\ if A&B] ~[$Pr(A\ B&Q if B]) within KB8CC From 2. Since B is known to be true, it can be added to both sides. 4. ([$1 if A&B] ~ [$Pr(A \ B&Q if B]) within K^B&C Both contracts are worthless. 5. ([$1 if A&B] ~ [$Pr(A \ B&Q if B]) within KC From 3 and 4. One of KB&C, K^B&C 1 S a n authority for KC. 6. ([$1 if B] ~ [$Pr(B | O if O]) within /T c Consequence of Lemma 2.4.3. 7. ([$Pr(A | B&Q if 5] ~ [$Pr(A | B&QPr(B \ Q if #]) given C From 6, multiplying both sides by Pr(A | fl&Q, using Lemma 2.4.6. 8. ([$1 if A&B] ~ [$Pr(A | B&QPr(B \ Q if 0]) given C From 5, 7, using transitivity. 9. Pr(A&5 | O = Pr(A \ B&QPr(B \ Q From 8, by L. 2.4.3, as required. This fourth axiom of the probability calculus is also known as the Principle of Conditioning. Note that it is not a diachronic constraint on one's beliefs, for time has nothing to do with it; rather, it describes the partial entailment relation between states of affairs. It is often written in the equivalent form (suppressing C for simplicity): Pr(A \ B) = Pr(A&B)/Pr(B), 72 and referred to as the definition of the symbol Pr(A \ E), or as the definition of "conditional probability". This is a mistake, for two reasons. First, all probabilities exist only relative to an epistemic state, so Pr(B) means Pr(B \ O), which does not exist in general. Second, the symbol Pr(A | B) already has a meaning, as the value of [$1 if A] within KB. That is why we read it as "the probability of A given B". One cannot give two separate definitions for the same symbol, unless their equivalence is demonstrated. 2.5 Relative Probabilities It is common to talk of relative probabilities, such as in the statement "A is three times as likely as B". What do these assertions mean, however? Perhaps the most obvious answer is that "A is three times as likely as B" means that Pr(A)IPr(B) = 3, or Pr(A) = 3.Pr(B) (1). Using Definition 2.4.1, (1) is equivalent to va/[$l if A] = 3.va/[$l if B], or va/[$lifA] = va/[$3iffl] (2). Statement (2) entails, but is not equivalent to, [$1 if A] ~ [$3 if 5] (3). Although (1) entails (3), statement (3) does not entail (1) since va/[$l if A] and va/[$3 if B] might not exist. Thus (3), being weaker than (1), requires a new notation, and will be written R(A,B) = 3. In general we have the following definition. 73 2.5.1 Definition For propositions A, B we define R(A,B), the probability of A relative to B, as follows: R(A,B) = p just in case [$1 if A] ~ [$p if B]. The following theorems are obvious. 2.5.2 Theorem Pr(A) = R(A,0). 2.5.3 Theorem If Pr(B) exists, then R(A,B) = Pr(A)IPr{B). A more interesting theorem is a generalised version of the principle of conditioning, which is proved in the same way as Axiom 4. 2.5.4 Theorem Pr(A | B) = R(A&B,B) The advantage of Theorem 2.5.4 over Axiom 4 is that there is no need for Pr{B) to exist. This is of particular importance to cases where one wants to condition on very unlikely events, whose probability might be described as "infinitesimal". Consider, for example, an infinite lottery, where a ticket is selected at random from an urn that contains a countably infinite number of such tickets. Does it make sense to assert that every ticket has the same probability of being selected? Though this seems like a meaningful scenario, one sometimes hears that it is impossible, on the grounds that it cannot be represented by a probability function. If each ticket has probability 8, then 8 has to be either zero or a finite, positive number. Both of these are ruled out by the addition axiom, however. The problem here is that, if each ticket has the same probability, then that probability has to be infinitesimal, so to speak. If the proposition A says that some ticket a is picked, then [$ 1 if A] > $0, since A might be true, but [$ 1 if A] < $8, for each 8>0. 74 Using relative probabilities, we can express the fact that A and B are equally likely by the statement R(A,B) = 1. Thus, if b is some ticket distinct from a, and B says that b is picked, then it is meaningful to assert that [$1 if A] ~ [$1 if B], i.e. R(A,B) = 1. It is clear that the "absolute" probability function Pr can be reduced to the relative probability function R, since Pr(A | B) = R(A,0), within the state KB. Is a reduction possible in the reverse direction, however? Can relative probabilities be reduced to relations between absolute probabilities? The answer to this is clearly No, if we are restricted to real-number probabilities, as the case of the infinite lottery shows. This limitation of the real number system is surely related to the point made in connection with Definition 2.3.4, that real numbers are only approximate representations of measure. Two objects of slightly different value can generate the same Dedekind cut on the quotients. Unfortunately there does not seem to exist any system of measures that is more accurate than the real numbers. 2.6 Interval Probabilities In §2.4 we assumed that the contract [$1 if A] had an exact dollar value within KB. This is not the case for every pair of propositions A and B, however, as we shall see below. More generally there exists some interval of values, each of which is incommensurable with [$1 if A]. Suppose that, for some good $p, we have that [$1 if A] <> $p. This entails that there does not exist any good $a such that [$1 if A] ~ $a, so that va/[$l if A] does not exist, as the following theorem shows. 2.6.1 Theorem If [$1 if A] <> $p, then there is no good $a such that [$1 if A] ~ $a. Proof. Suppose that there is some ae [0,1] such that [$1 if A] ~ $a. Then, since p is also a real number, we have that either p<a, p=a or p>a. These cases entail however that [$1 if A] < $p, [$1 if A] ~ $p, and [$1 if A] > $p respectively, contrary to the hypothesis.* 75 2.6.2 Definition If val[$l if A] does not exist, within KB, then let Pr(A \ B) equal the set I, where 7= [p: [$1 if A] <> $p (within KB)}. 2.6.3 Theorem / is an interval (if we include 0 as an interval). Proof: Any set D such that, if a, c e D and a<b <c, then b e D is an interval. (Thus the null set and singletons are intervals.) For the sake of reductio, therefore, let a, c e I but suppose that there exists some b g I in between them, i.e. a < b < c. Since b <£ I, it follows that either [$1 if A] < $b, [$1 if A] ~ $b, or [$1 if A] > $b. In each of these cases we can refute at least one of the statements a£ I,c <£ I. Thus, by reductio, each b between a and c is a member of 7, and so / is an interval. • Thus, if we define Pr(A \ B) as in Def. 2.6.2, a probability is a sub-interval of [0,1]. One kind of example where Pr(A \ B) is plausibly an interval is the following. Suppose B says that, in a particular urn, the proportion of black balls is somewhere in the interval [a, b], and that a ball is drawn from the urn at random. A says that the ball drawn is black. Let us assume for now that, if the proportion of black balls were known to be p, then the probability would also be p. It then follows that Pr{A \ B) = [a, b] in this example. It should be noted that, although the axioms of probability are stated in terms of single-value probabilities, they also apply to interval probabilities. The trick here is to represent the interval Pr(A \B), = [a{, a2] say, as a parameter x that ranges over [a{, a2], i.e. as an arbitrary function x whose range is [au a2]. The advantage of this parametric representation is that the parameters obey the axioms of probability. If Pr(A \ B) = x, for example, then Pr(-A \ B) = l-x. Thus we immediately infer that Pr(—A \ B) = [l-a2, 1-aJ, i.e. the range of the parameter l-x. 76 Later in this thesis, in chapters 4 and 5, some problems involving correlation will be examined. A crucial concept, therefore, is that of probabilistic independence, which we will now define for the logical probability function. 2.6.4 Definition A and B are logically independent just in case Pr(A \ B) = Pr(A). Independence implies consistency, of course, or else Pr(A \ B) = 0. O is independent of everything, even O. The intuitive idea of independence is that A and B have no overlap of content, i.e. they do not "intersect". The two propositions "Smith is an Albertan farmer", and "Smith is a dairy farmer", for instance, both say that Smith is a farmer - they overlap, in other words. For this reason the propositions are not independent; if we learn that Smith is an Albertan farmer, this increases the probability that Smith is a dairy farmer, because we can now be sure that he is at least a farmer. 2.7 The Symmetry Axiom As I have presented them, the axioms of probability are not part of the definition of the logical probability function, for that is defined in Def. 2.4.1. Moreover, within the theory of logical probability they do not function as axioms, but as theorems, since they are proved from the definition of probability. The purpose of proving the axioms is not to define probability, nor to argue that the axioms are true, but rather to confirm Definition 2.4.1 by showing that the axioms all follow from it. An analysis of logical probability that did not allow one to prove the axioms would be unacceptable. One common element in previous accounts of logical probability is the use of a symmetry, or indifference, principle. It is widely felt that, if there are logical probabilities, then they must be assigned largely (though by no means wholly) on the basis of symmetries. I also 77 hold this conviction, and so in this section I shall develop and prove a fifth axiom, the symmetry axiom. The notion of symmetry is well known in geometry, so we shall begin there. In geometry, however, symmetry is usually considered to be an absolute property of a single object. It is absolute in the sense that we say figure A has reflective symmetry tout court, not that it has reflective symmetry w.r.t. figure B. It is a property of a single object, rather than a relation between two objects, in the sense that we say A has reflective symmetry, rather than saying that the pair {A,B} has reflective symmetry. We shall see however that the ordinary notion of symmetry is easily extended to become relative and relational in the required way. Let us consider a geometrical plane, containing some geometrical figures. The notion of symmetry in geometry is defined using transformations, known as isometric transformations, or isometries, which preserve the distances between all pairs of points. Such functions include translations, rotations and reflections, and combinations of these. For a particular isometric transformation/, (such as rotation through angle 7t/3 about point (2,5), or reflection in the line y=2x) a figure A has/-symmetry just in case fA=A. That is to say, if we consider A to be the set of points of which it is constituted, then the set of images under/of points in A is just the set A itself. To be more precise, we shall represent a plane "universe" U as a "colouring function" u, which maps R 2 to {0,1}. The idea is that u(x,y) represents the colour of the point (x,y), where 0 means the background colour (white, say) and 1 means the ink colour, perhaps black. In this representation a figure is a set of points that are coloured black. A transformation on U is a function / which maps R 2 to R 2, as usual. What colouring function u represents fU, i.e. the image of the universe U under f? If u(x,y) = 1, then we want that black point to be moved to f{x,y), so that u'(f(x,y)) = 1. White points need to be moved in the same way. We therefore have that u'(f(x,y)) = u(x,y), which entails that u - ufx. Thus, if u represents U, then/77 is represented by ufl. 78 We are now ready to extend this notion of symmetry to make it a relation between a pair of figures within a "universe". Consider for instance a figure A which is not/-symmetric, and suppose that the function/is its own inverse, i.e. f=f{, or ffA=A.19 Then the pair of figures, A u fA, which may be considered a single, non-connected figure20, has /symmetry in the above sense. This is clear since/(A u/4) =fA u jfA =fA u A, as required. In general we can say that A and B are/-symmetric, i.e. A bears the relation of /-symmetry to B, just in case fA=B, and the combined figure Au5 has/symmetry, that is/(Au5) = Au5. Suppose that A and B are / symmetric, and consider a plane which contains only the figures A and B. This plane will be called the universe U. It is clear that, within U, there is no way to pick out either A or B using only distances. If someone asks which of the figures is A, and which is B, one cannot say: "A is the one that since there is no property F, which depends only upon distances between pairs of points in U, such that A has F but B does not. This may be clearer if we draw the universe U, as in Figure 2.7.1 below. V Figure 2.7.1 (Note that the border is not part of U, as U consists only of A and B.) We single out A from among the pair {A, B} by means of the shapes of A and B themselves, as fA=B and all distances 1 9 Since A is a set of points , /4 is the set of images of these points under/, i.e.fA = {(x,y):f'(x,y) e A}. 2 0 Reca l l that a figure such as A is a class of points that are coloured black, so that A u B is just the set of points that are either in A or in B or both. 79 are invariant under / (which is a reflection in this diagram). Moreover, we cannot pick A out from {A, B] by means of its relationship to the rest of U either, using only distances, since /(AKJB) = A\JB. For consider some relation R, definable in terms of distances, which A bears to the rest of U. (We may think of this as A's view of U.) Since U as a whole is invariant under/, and/maps A to B, ZTs view of U is distance-indistinguishable from A's view. Thus B must bear R to the rest of U as well, which means that R cannot be used to single out A. This matter of not being able to "single out" either of A or B may become clearer if we consider another universe where A and B are not symmetric, such as the one in Figure 2.7.2. U Figure 2.7.2 Here there is an isometric transformation/such that fA =B, namely a horizontal translation, but we do not have/7J= U since/maps B even further to the right rather than onto A. It is also clear that, while A and B are internally the same shape, their views of U differ. A sees another hand with the thumb pointing away from him, whereas B sees another hand with the thumb pointing towards him. The figure A can thus be picked out within U as the hand whose thumb is pointing at another hand in U. This feature of a symmetric pair of objects, A and B, that neither can be uniquely singled out from among the pair {A, B), is not a mere curiosity but the very property of symmetry which (in the context of propositions, rather than geometrical figures) gives rise to the symmetry axiom. Indeed, we may consider the fundamental definition of symmetry (at least the 80 kind of symmetry required for the symmetry axiom) to be as follows: A and B are symmetric in U just in case neither can be singled out from among {A, B) within U. It is crucial to recognise the difference between symmetry and similarity. Symmetrical objects may be dissimilar. Consider, for instance, two right-handed gloves of the same type. The relation between these objects is similarity, or qualitative identity. There is no contrast between them. Now think of two gloves of the same type, but where one is left handed and the other right handed. They are not similar, as there is a contrast between them. This contrast emerges clearly if we imagine that someone is given a pile of left-handed and right-handed gloves, all mixed up. He will have no difficulty in separating them into two distinct piles, one of left-handed gloves and the other of right-handed ones, but he cannot tell which pile is which unless he has some previously-identified right-handed (or left-handed) object to compare them to. If two objects are similar then they are also symmetric, but the converse does not hold, as the glove example shows. Two objects that are not similar may still be symmetric, if the contrast between them is of a kind that does not allow either one to be uniquely singled out. Of course, the possibility or otherwise of using a dissimilarity between A and B to pick out (say) B depends on the resources available. Our definition of geometrical symmetry is then as follows: 2.7.3 Definition A is symmetric to B, which we abbreviate to A Y B, within U if and only if, for some isometric transformation f,fU=U andfA=B. 2.7.4 Theorem Symmetry is a symmetric relation, i.e. AY B => BY A (all within LO-Proof. Suppose that A Y B, within U. Then, for some isometry ffU=U and fA=B. Now, since / i s isometric, so is / 1 . Then/ 1 ^ = flfU = U, and flB = fxfA = A. Thus/ 1 is the required isometry to make it the case that BY A within UM 81 2.7.5 Theorem Symmetry is also reflexive and transitive, and thus an equivalence relation. Proof AT A within U, since the identity transformation is an isometry and maps U to U and A to A. To show that Y is transitive, suppose that AT B and B T C, by virtue of isometries /and g respectively. Then the composition gf'is also an isometry which maps U to U, and gfA = gB = C. Thus A T C, as required." 2.7.6 Theorem If A T B within U, then it is impossible to single out A from among {A, B} using properties which supervene on distances within U. Proof. Suppose that AT B within U, by virtue of isometry / . It is then clearly impossible to find a difference between A and B by comparing distances within A to those within B, as the fact that fA =B entails that A and B are isomorphic with respect to distance. Moreover, A's relation to U is indistinguishable from that of B, since U is itself invariant under/.B So far we have only considered A-B symmetry within a universe U consisting only of A and B themselves. This is an unnecessary restriction, as Definition 2.7.3 and the resulting theorems can be applied to any geometrical universe at all. Consider, for instance, the universe in Figure 2.7.7. 82 r "A u U A D B C n Figure 2.7.7 The universe U now contains four figures, A, B, C and D, and we see that A and 5 are symmetric in U. This is because there is an isometry / , namely a rotation of it/A about the obvious point, such that fA =B and fU=U. It is also clear that it is impossible to single out A from among {A, B} within U using only distances, as we should expect from Theorem 2.7.6. A and B are not the only symmetrical pair within U, as A and C are also symmetric for example. In fact, every pair of figures in U is symmetric in U. I have so far discussed the notion of relational symmetry in the context of geometry, as it is a simple way to convey the basic idea. We are really interested however in symmetry between states of affairs. Is it possible to apply the definition we have of "A and B are symmetric within £/" to states of affairs A, B and UI What would this intuitively mean? Let us consider the pure epistemic state Kv. We shall now consider two states of affairs, A and B say, within K,j, i.e. the relative states of affairs A/U and B/U. The intuitive idea is that the relation of A and B being symmetric within the epistemic state Kv is analogous to the geometrical case defined above. Kv does not contain sufficient resources to single out A/U from among the pair {A/U, B/U}. We therefore have the following definition: 83 2.7.8 Definition A is symmetric to B, which we abbreviate as A T B, within Kv just in case A/U cannot, within Kv, be singled out from among {A/U, B/U}. To get a firmer grip on this definition, let us think about similarity and symmetry between states of affairs, with the help of an example. Consider a pair of dice, a and b, that are intrinsically similar, having the same size, shape, weight, colour and so on. As any high-school student of probability knows, there are two ways to obtain a total score of 11 with two such dice: you can have a 6 on a and a 5 on b, or a 6 on b and a 5 on a. Even though these two states of affairs "look the same", as the dice are similar, they still count as distinct possibilities as the dice are distinct entities. The two states of affairs are not similar, as there is a contrast between them, but they are symmetric. The perfect similarity of the dice prevents either from being uniquely singled out from among [a, b}, and so neither of the states of affairs can be singled out either. I can see no difference, with regard to states of affairs, between similarity and identity. Similarity between states of affairs seems to be sufficient for identity. If there is no contrast between states of affairs A and B, then they are the same state of affairs. Now we have a definition of symmetry, we are ready to state the Symmetry Axiom. It is as follows. Axiom 5 If A Y B within Kw then [$ 1 if A] ~ [$ 1 if B] within Kv.21 Proof By symmetry(!) it is sufficient to prove that [$1 if A] > [$1 if B], i.e. that there is no objection within Kv to giving up [$1 if B] in return for [$1 if A]. Now, assuming AT B within Kv, neither contract can be singled out within Kv. Thus, one cannot single out the case of 2 1 I am tempted to miss out the number 5 and call this Ax iom 6, as 5 is an unlucky number for axioms. Consider the trouble with Frege's Rule V, Euclid's f i f th postulate and Peano's fifth axiom. 84 possessing [$1 if A], as compared with the case of possessing [$1 if B]. (One can see that these cases are distinct, but one cannot pick out either case.) This being so, a rational person does not object to an exchange of [$1 if B] for [$1 if A], as required.B 2.7.10 Theorem If A Y B within Kv, and val[$l if A] (within Kv) exists, then Pr(A \ U) = Pr(B | U). Proof By Axiom 5, if A Y B within Kv then [$1 if A] ~ [$1 if B], in Ka. If va/[$l if A] (in Ka) also exists, then by Theorem 2.5.5 it follows that va/[$l if A] = va/[$l if B], in Kw i.e. Pr(A \ U) = Pr(B | U)M 2.1.W Theorem If AXIU, A2/U,AJU are pairwise symmetric within Kw and also pairwise mutually exclusive and jointly exhaustive within Kv, then Pr(At | LO = Pr(Aj \ U) = 1/n. Proof Since AYIU, A2/U, AJU are mutually exclusive and jointly exhaustive within Kv„ we have that {[$1 if A,], [$1 if A2], [$1 if AJ} ~ $1. Also, since they are pairwise symmetric, Axiom 5 entails that [$1 if A J ~ [$1 if A2] ~ ... ~ [$1 if AJ . Then, using independence, it follows that {[$1 if AJ, , [$1 if A,]2, [$1 if A,]„} ~ $1, i.e. n separate contracts [$1 if A,] are together worth $1 as well. By Def. 2.3.4 we thus get val[$\ if A,] = and then by Theorem 2.3.5 we deduce that va/[$l if A,] = va/[$l if Ay] = 1/n. The result is immediate.• This latest theorem bears more than a passing resemblance to Keynes's Principle of Indifference, and its predecessors. This should be a cause for concern since, while many issues in the foundations of probability are disputed, there is an impressive consensus that all such principles are invalid. Van Fraassen states, for instance, "... I regard it as clearly settled now that probability is not uniquely assignable on the basis of a Principle of Indifference, or any 85 other logical grounds"22. Howson and Urbach concur: "... 'logical' probability measures, whether based on the Principle of Indifference or on some other method of distributing probabilities a priori, do not, we believe, possess a genuinely logical status. For such systems are ultimately quite arbitrary, and we take logic to be essentially noncommittal on substantive matters."23 Why are these authors so sure that any equi-probability principle must fail? There are two main reasons. First, there are the so-called paradoxes which these principles generate, whereby different ways of applying them to the same problem produce inconsistent results. Second, as mentioned in §2.1, there is the fact that purportedly a priori measures, such as Carnap's c* measure24, which are intended to provide a purely logical basis for inductive inferences, clearly contain synthetic assumptions about the world. These reasons do not seem to count against Theorem 2.7.11 in the least, however. It is true that this theorem is of no use at all in giving a theory of inductive inference but, as explained in §2.1, inductive inferences are not purely logical and so this is just as it should be. Moreover, as is demonstrated below by examining some examples, this definition of symmetry cannot be applied to a single problem in different ways, yielding inconsistent conclusions. Let us look at some previous formulations of the Principle of Indifference, and the problems they face. According to Howson and Urbach, a typical formulation of the Principle from Bernoulli to Keynes is as follows:25 i f there are n mutually exclusive possibilities A i , An, and U gives no more reason to believe any one of these more l ikely to be true than any other, then Pr(A} \ U) is the same for all i. 2 2 v a n Fraassen (1989:292). 2 3 H o w s o n and Urbach (1993:71-72). 2 4 Carnap (1950). 2 5 H o w s o n and Urbach (1993:52). I have adjusted the notation to harmonise it wi th the rest of this chapter. 86 This principle is significantly more liberal than Theorem 2.7.11 in a manner which is highlighted by the following example. Let U be the state of affairs that a certain bag contains a nickel and a dime, one of which is drawn out, and let A be that the nickel is drawn, and B the dime. We should now consider the two contracts [$1 if A] and [$1 if B] from the point of view of U. According to the Principle above, Pr(A \ U) = Pr(B | U), since U gives us no reason to suppose that either A or B is more likely to be true that the other. From this it follows that [$1 if A] ~ [$1 if B], but does Axiom 5 give the same result? It does not, as it is easy to single out either state of affairs, and so A and B are not symmetric within U. It seems very reasonable, in this example, to rule out the relations [$1 if A] > [$1 if B] and [$1 if A] < [$1 if B] between these contracts, since U does give us no reason for a definite preference of one over the other. Such a preference surely would be irrational, as it would be under-motivated. The Principle seems then to conclude that [$1 if A] ~ [$1 if B], by elimination, but this is incorrect in my view since there is a fourth possibility, written [$ 1 if A] <> [$1 if B], that the contracts are incommensurable. Nothing of interest follows from the relation [$1 if A] <> [$1 if B]\ for instance, it does not follow that Pr(A \ U) = Pr(B \ U). The above Principle is therefore invalid, as lack of grounds for supposing the probabilities to be different is insufficient to make them equal. A clearer example of two incommensurable contracts is as follows. U says that there are two urns, a and b, each of which contains only black balls and white balls. The proportions of black balls in each urn are unspecified, but U does say that the proportion in a lies in the interval [0.1, 0.5], and that of b lies in [0.2, 0.3]. A ball is selected from each urn. Let A be the state of affairs that the ball from a is black, and B that the ball from b is black. We are able to distinguish between A and B within Kw as U gives different information about the two urns, so 87 A and B are not symmetric. It also seems clear, however, that there is no reason to prefer the contract [$1 if A] over [$1 if B], or vice-versa; these contracts are therefore incommensurable.26 We have shown that symmetry is an equivalence relation, which means that granting equi-probability on the basis of symmetry is safe from outright contradiction. The same is not true for giving equi-probability to any pair of propositions A and B such that [$1 if A] o [$1 if B], since <> is not an equivalence relation. In particular the relation is not transitive, as a slight addition to the urn example shows. Let U also say that urn c has a proportion of black balls in [0.4, 0.45], and that a ball is selected from c, with C saying that it is black. Then there is no reason to prefer [$1 if C] over [$1 if A], and no reason to prefer [$1 if A] over [$1 if B], but [$1 if C] is definitely preferable to [$1 if B]. It is therefore inconsistent to assert Pr(C \ U) -Pr(A | U) and Pr{A \ U) = Pr(B \ U), as it implies that Pr(C \ U) = Pr(B \ U). Even though there is insufficient reason to give different probabilities to a pair of states of affairs, that does not warrant giving them the same probability, for they may not have probabilities at all.27 We should not be surprised therefore if this Principle of Indifference leads to contradiction (which it does, as everyone knows). The most famous examples of such contradictions are the ones developed by Joseph Bertrand28, a selection of which we will now consider. Consider a cube, whose edge length is given to be less than 2cm. What is the probability, given this information, that the edge length is less than 1cm? If we let the random variable L represent the edge length of the cube, then we see that the intervals (0,1] and (1,2] are of equal (Lebesgue) measure. Then, letting A represent Le(0,l] and B represent Le(l,2], we seemingly have no reason to regard A as more likely than B, or vice-versa, so the classical Principle of Indifference declares them equi-probable. The problem arises when we realise that 2 6 I t should be noted that Pr(A\U) and Pr(B\U) do not exist, as [$1 i f A] and [$1 i f B] do not have precise values. For instance, [$1 i f A] is clearly preferable to 100, and less preferable than 500, but it is impossible to say more than this. 2 7 T h i s crit icism of the classical Principle of Indifference is given in Keynes (1921:42). 2 8 See Bertrand (1889). 88 the information we have, that Le(0,2], may equivalently be expressed as L3e(0,8]. L 3 , of course, represents the volume of the cube, another perfectly acceptable physical quantity. Assigning equal probability to sets of equal measure in (0,8], however, results in B being seven times as likely as A! Another famous problem, known as Bertrand's Paradox, concerns the length of a chord. Given merely that AB is a chord on a given circle, what is the probability that the length of AB is greater than V3 times the radius of the circle? In other words, what is the probability that AB is longer than the side of an equilateral triangle inscribed in the circle? (Call this proposition E.) Bertrand provides three incompatible solutions to this problem, each one justified by the Principle of Indifference, which are as follows. (a) Let us suppose we are given the location of A, one end point of the chord. (By the rotational symmetry of the problem, this should not affect the probability.) We then consider the angle between AB and the tangent to the circle at A. The angle lies in [0, n], and E is true iff the angle is in [TC/3, 2TC/3], so Pr(E)=\/3. (b) Let us suppose we are given the direction of AB, so that AB is limited to a set of parallel chords. (By the rotational symmetry of the problem, this should not affect the probability.) There is a unique diameter to the circle which is perpendicular to these chords, so let us consider the point of intersection P of AB and this diameter. P determines AB, and vice-versa, but P may lie anywhere on the diameter. It is easy to show that E is true iff P is within half a radius of the centre of the circle. Assigning equal probabilities to equal segments of the diameter, we deduce that Pr(E)=\/2. (c) Excluding the case where AB is a diameter, the position of AB is uniquely determined by the position of its middle point M. E is true, of course, if M is less than a half radius from the centre of the circle. The area of this region is one quarter of the area of the circle, so Pr(E)=\IA. 89 These examples are devastating and unanswerable, showing decisively that the classical Principle of Indifference is invalid. There have been previous attempts to rehabilitate this Principle, however, the best of which (in my view) being that of J . M. Keynes. Keynes's version bears many similarities to my own Axiom 5, but is firmly rejected by current scholarship. It should therefore be examined, so that its similarities and differences to my own approach may become clear. Keynes sees the need for the Principle of Indifference to contain positive, rather than negative, criteria for equi-probability. The mere absence of grounds for assigning different probabilities is not enough; we require instead the presence of grounds for assigning equal probabilities. What positive criterion is offered? It is that "...our relevant evidence ... must be symmetrical with regard to the alternatives, and must be applicable to each in the same manner", and elsewhere, "If this relevant evidence is of the same form for both alternatives, then this Principle authorises a judgment of indifference".29 This idea of symmetry, or "sameness of form" of the evidence between different possibilities is the mainstay of Keynes's idea, but there are two other, less attractive, features which require mention. First there is the notion that only the relevant evidence need be symmetric with regard to the alternatives. The idea seems to be that we take our evidence U, and the pass it through a kind of filter which removes all information which is irrelevant to the relative likelihood of A and B, to get an epistemic state If. A and B are then equally likely relative to U just in case If is symmetric over A and B. Thus, in general, a judgment of indifference consists of two stages: first we deem certain data irrelevant to the issue, and second we find that what remains is symmetric between the alternatives. The filtering out of irrelevant information makes Keynes's Principle more liberal than it would otherwise be, since the information excluded as irrelevant is often sufficient to break the 2 9 Keynes (1921:55-56). (The italics belong to the original text.) 90 symmetry. To see this, let us consider one of Keynes's own examples (1921:53-54). Suppose we know that an urn contains two balls, one white and one black, one of which is drawn out. Is the selected ball equally likely to be white as black? Keynes acknowledges here that "we know of some respects in which the alternatives differ", which presumably entails that the evidence is not symmetric with regard to these alternatives, but claims that "a knowledge of these differences is not relevant". Thus, according to Keynes, since the colour difference is irrelevant the two balls have the same probability of selection relative to this evidence. It should be agreed, I think, that judgments of irrelevance are an important and valid part of calculating epistemic probabilities, but they do not seem to fall within the realm of pure logic. In my view, which I defend below, they involve the human faculty of le bon sens, and so rely upon some synthetic knowledge of the world. They are thus important for epistemic probability, but inadmissible in logical probability. If we consider again the irrelevance of colour in the urn example, why is it that differences of colour can be ignored? If the balls were different sizes, would that be irrelevant? What if we had a ball and a cube instead of two balls? Is shape irrelevant also? What if one of the balls is sticky? I suspect that, at some point in this list, the differences would be considered relevant, so why is colour irrelevant? The only reason I can see is that balls are usually selected from urns by a blindfolded person, who clearly is unable to distinguish colours by touch alone. With the logical probability Pr(A \ U), however, we consider the epistemic state consisting only of knowledge of U. There is no additional "background" information concerning the customary ways of drawing balls from urns. Moreover, if this is what Keynes has in mind30, then it seems that he is confusing two kinds of independence. We are here dealing with what is called logical independence, or irrelevance, whereas for a blindfolded selector the colour is causally irrelevant. Now, it is shown in Chapter 5 that causal independence gives rise to independence in the physical chance function, but this is very different from independence in the logical 3 0 H i s reference (p. 54) to the case where the balls are drawn by a magnet, and the balls are made of different metals, suggests strongly that it is. 91 probability function. Finally, it seems that to filter out "irrelevant" information takes us back towards the classical Principle of Indifference - the very last thing Keynes wants to do. For the question of what counts as a reason to prefer A over B is rather similar to the question of what counts as relevant information in considering the symmetry between A and B. The second unattractive feature of Keynes's Principle is that he adds an extra necessary condition for equi-probability, that the alternatives be indivisible. The extra condition is motivated by cases where it seems that the relevant information is symmetric over A and B, but that it is also symmetric over A,, A2 and B, where A = A, v A 2 and A, => —A2. If symmetry alone is sufficient for equiprobability, and A and B are exhaustive, then the symmetry of A and B yields Pr{A \ U)=l/2, but the symmetry of A 1 ; A 2 and B gives Pr(A \ U)=2/3. The condition of indivisibility forbids the declaration of equi-probability in precisely this sort of case, where one or more of the symmetric alternatives is divisible into sub-alternatives which are symmetric with the rest of the original set. If such cases are indeed possible, where sub-alternatives are symmetric with the rest of the original set, then clearly the symmetry condition alone will not suffice. To patch up the Principle with additional conditions seems ill-advised, however. For, if such examples exist then it follows that symmetry (in Keynes's sense, whatever that is exactly) is not an equivalence relation. In the case above we have A{ symmetric to B, and B symmetric to A, but A, is not symmetric to A, so that the relation is not transitive. Now this seems plain wrong, particularly if we think of symmetry as "sameness of form". If U has the same form for A and B, and for B and C, then it must also have the same form for A and C, since identity is transitive. Any relation which can be thought of as "sameness of/', so that a is related to b just in case fa=jb, must be an equivalence relation. If it turns out that symmetry on some account is not an equivalence relation, then that account is wrong and needs to be corrected. To cover up the problem with ad hoc tinkering is unacceptable. Keynes has very little to say about the relation of symmetry itself, and I myself cannot see how sub-alternatives might be symmetric with some members of the original set. It is 92 provably impossible within Definition 2.7.8, as is shown by Theorem 2.7.5. Keynes's revised Principle has been attacked on account of the arbitrary nature of its indivisibility condition31, as well as for its perceived failure to avoid the problems raised by Bertrand. Since Keynes has almost nothing to say about the relation of symmetry itself, I am not sure whether or not his principle escapes the Bertrand paradoxes. From my point of view, however, the question is of little importance, as I have shown that there are important differences between Keynes's Principle and my Axiom 5. The differences, in summary, are as follows. (i) Axiom 5 is based on a highly informative definition of symmetry (Defs. 2.7.3 and 2.7.8), whereas Keynes's Principle involves no discussion of symmetry at all. (i) Axiom 5, unlike Keynes's Principle, does not allow information in U to be dismissed as irrelevant. (ii) The symmetry of Def. 2.7.8 is shown to be an equivalence relation, which means that no condition of indivisibility is required. What needs to be shown is that Axiom 5 does not give rise to paradox in the sets of circumstances concocted by Bertrand. To this I now turn. First there was the example of the cube, for which we can consider either its edge-length L or its volume L 3 . Given that Le(0,2], what is the probability of A, that Le(0,l]? The first question is: Is A symmetric to B, where B says that Le(l,2]? If we consider an arbitrary world that entails A, and also one that entails B, we find that the cube in the world for A always fits inside that of B. It is therefore easy to pick out the state of affairs A from among {A, B). In a similar way, equal-length intervals in the range of L 3 are not symmetric either. There is therefore no contradiction here. 31See Howson and Urbach (1993:61-62). 93 Now let us look at Bertrand's paradox. Which, if any, of the three solutions is correct? It turns out that none of them is, although (b) is the correct solution to a very similar problem. Solution (a) involves partitioning an angle of 7t into three equal sectors, but are these symmetric? The outer two, [0,7t/3] and [2n/3, n], actually are symmetric, as becomes clear if we consider the reflection about the diameter which includes the chord-end A. The middle sector, (7i/3, 27t/3), is however not symmetric with either of the other two. It can be singled out, for instance, as the one in the middle. Thus the first argument fails. Argument (b) considers a set of possible positions for AB which are all parallel, and perpendicularly bisected by a diameter D. The position of AB is therefore determined by its point P of intersection with D. The argument assumed that equal-length segments of D are equally likely to contain P, but are these states of affairs symmetric? Some pairs are, but most are not. For instance, if we represent D by the interval [-r, r], then [-r, 0) and (0, r] are indeed symmetric; but [-r, -r/2] and [0, r/2] are not, as the former is closer to the circumference, and can be singled out as such from among the pair. Roughly speaking, a segment near the circumference is not symmetric to one which close to the centre. The second argument fails. Argument (c) considers the mid point M of AB, which (except in the case where AB is a diameter) determines the position of AB. It assumes that equal areas inside the circle are equally likely to contain M. It is true that some such pairs of regions are symmetric, but most are not on account of their differing distances to the circumference. This argument also fails. One may feel intuitively that argument (b) is somehow better than the other two. (If so, this is likely to be encouraged by the fact that, if one drops straws from a great height onto a floor with a circle drawn on it, then about one half of the chords produced are longer than the side of an inscribed equilateral triangle, in accordance with solution (b).) We can make sense of this intuition, using the symmetry axiom. In the physical experiment where the straws are dropped, there is presumably no causal interaction between the falling straws and the circle below; thus the final positions of the straws are causally independent of the circle's position. As I mentioned before, there is a link between 94 causal and probabilistic independence (demonstrated in Chapter 5), so we may wonder what solution may be obtained if we assume that the circle's position is irrelevant to the position of the full line AB. Under this assumption we find that solutions (a) and (c) are still invalid, but something similar to (b) is valid. This will now be demonstrated. The chord AB is part of an infinite line; the chord AB determines the line AB, and the line AB together with the circle C determine the line segment (i.e. chord) AB. Let us then consider the line AB for now. The state of affairs U does not say anything about AB except that it intersects C, but let us consider IT which also specifies the direction of AB. Thus, within K^, all the possible positions for AB are parallel to one another. Let us define the rectangle Im as the subset of these parallel lines which lie between two of them, / and m. U does not specify the diameter of C, but let us suppose that IT gives the diameter as 5cm. Then precisely one of the rectangles of width 5cm contains C. Since we have no information about the position of C, any two propositions specifying different 5cm rectangles in which C lies are symmetric. Also, for any rectangle R of width 5cm, the various sub-rectangles of equal width inside it are symmetric within IT, as each may be mapped to the others under translation, which leaves the plane as a whole unchanged. Now suppose we are told that a particular 5cm rectangle, R*, is the one which contains C. Its sub-rectangles now lose their symmetry with one another, as we can use C to pick out some over others. Now we apply our (extra-logical) assumption that the position of C is irrelevant to the position of the full line AB. This means that we cannot use C to single out any equal-width sub-rectangle in R* from among a pair of them, so they retain their symmetry. Then states of affairs of the form "the full line AB is in rectangle a," where the a, are equal-width sub-rectangles of R*, are symmetric within IT. Now, even though we do not know where C lies within R*, the line AB still determines the length of the chord AB. The original reasoning of (b) can now be used to show that the length exceeds that of a side of an inscribed equilateral triangle with probability 1/2. 95 We are not done yet, as we have only shown that Pr(E \ U)=\I2, and so Pr(E \ U) is still unknown. IT contains two extra pieces of information over U, namely the diameter of C and the direction of AB. It is clear from the argument above, however, that the particular values of these two data do not affect the result, that the probability is 1/2. Then, since Kv> is an authority for Kv, it follows that Pr(E \ U) = Pr(E \ If), by the Authority of Logic principle. If this were a thesis on logical probability then it would be necessary to illustrate my theory of Pr with many more examples. In view of the modest aim of this chapter, however, which is merely to show that a revised notion of logical probability is viable, this would be excessive. 96 3. Physical Chance In this chapter I shall use the theory of logical probability developed in Chapter 2 to give an account of objective chance. There is nothing new about the basic idea, which is similar to a previous proposal1, but there are some innovations in matters of detail. What do we mean by physical chance? There are some probability statements which apparently just ascribe some physical property to a system. For instance, the half-life of a radioactive isotope may be defined as the length of time in which the probability of a given nucleus decaying is one half. Now the half-life of an isotope is a physical property, measurable experimentally and so on, which has nothing to do with anyone's state of knowledge. This kind of probability, which depends entirely on the physical properties of the system concerned, is called physical chance, objective probability, physical probability, objective chance, or just chance. The chance of a state of affairs A will be written P(A). 3.1 The Definition of Chance Like Lewis (1980:109), I think that Miller's Principle2 shows us how to proceed in giving a theory of chance. Miller's Principle, in its simplest form, states that the epistemic probability of some event A, given only that its physical chance is p, is also p. Thus we have Miller's Principle If one knows that the chance of A is p, then this authorises a degree of belief pin A. •See for instance Lewis (1980). 2 Unfortunately there is no sensible name for this principle. "The Principal Principle" is obviously sil ly, and van Fraassen's term, "Mi l le r ' s Principle" is also inappropriate because David Mi l le r rejected the principle. I use the latter term simply because it is less of a mouthful. 97 I write "in its simplest form", as the principle thus stated has counter-examples. Such cases arise when one's epistemic state K has too much extra knowledge about A, in addition to knowing that P(A)=p. For instance, if K includes knowledge of A itself (or not-A) then the principle fails. This complication is discussed in §3.3 below. The principle is important as it provides a link between the two kinds of probability, physical and epistemic. It shows us that the two kinds cannot be fundamentally distinct; one must be a special case of the other, or both are linked to some third, more basic, kind. The situation is analogous to the fact that, in Newtonian mechanics, the gravitational mass of a body is exactly proportional to its inertial mass. For Newton this was a brute fact which has no explanation, as the concepts of inertial and gravitational mass are totally different. The fact that they are proportional, however, shows that in fact they must be two manifestations of the same property, and so it turns out. In general relativity, only gravitational mass is required, and we can see how it determines a body's resistance to acceleration relative to the "fixed stars". If chance and epistemic probability are connected, then what is the link? Either one of them can be reduced to the other, or they are linked via some third thing. Epistemic probability cannot be a special case of chance, for it depends on the knowledge one possesses. However, it seems equally clear that chance cannot be a special case of epistemic probability, for the latter is very much bound up with human brain structure. The first constraint on physical chance, laid down in Chapter 1, was that chances depend only on the physical properties of the system in question. The third alternative is that epistemic and physical probability are linked via their being related to some third thing; this is the solution I propose. The third thing is logical probability. It is reasonable for chance to be a special case of logical probability, since the latter is independent of all contingent matters of fact. Thus, if the information given to the logical probability function is entirely concerned with physical properties of the system in question, then the logical probability of any event in this system, given those facts, is determined by those 98 properties alone. Moreover, there is a link between logical and epistemic probability, namely the Authority of Logic principle. We have seen that the logical probability function is a function of two arguments, whereas chance is a function of one argument. So, if we are to reduce chance to logical probability, we need to find something to fill the second argument place. Our task is to discover the state of affairs, U say, such that P(A) = Pr(A \ U). It would, as stated above, be quite intolerable if the chance function depended on human beings in some way. It could not then be described as "physical", or "objective". Thus the choice of U cannot depend on what any human knows about the system. On the contrary, U must be fixed by only physical facts about the system. As a first guess, we might say that U contains complete knowledge of the system in question, as chance should not depend upon ignorance in the way epistemic probability does. This clearly will not do, however, as A is also a state of affairs concerning the system, and so either A or —A will be a consequence of U, making all chances either 0 or 1. Clearly, we cannot tell the logical probability function everything, so what information do we give it? As stated in Chapter 1, the answer is to supply the logical probability function with a maximal specification of the causes of the actual history of the system, of which there are two. First there is the dynamical nature of the system, represented by the generalised lagrangian lx, and second there is the boundary condition bcx. The boundary condition is a constraint on the motion of the system that is independent of the system's dynamical nature. It may not be clear that the boundary condition for a system is a cause of its motion, or actual history. Indeed, although the boundary condition is a constraint of some kind, it is sometimes considered to be an epistemic constraint, rather than a physical one. In other words, the boundary condition may be viewed as a piece of information about the system, that constrains what we may believe about the motion, but does not physically constrain the motion itself. This attitude toward the boundary condition is strongly encouraged by the fact that, for a deterministic system, the possible boundary conditions are grouped into equivalence classes, 99 where the members of an equivalence class all generate the same actual history. Given this fact, how can it be maintained that just one of the boundary conditions is the real one, the cause of the motion, whereas the others in that equivalence class are not? The full answer to this question must wait until § 4 . 6 , where the matter is discussed in detail. The basic idea is that the grouping of possible boundary conditions into equivalence classes is a special feature of deterministic systems, that does not occur for any stochastic system. In a stochastic system, distinct boundary conditions give rise to different chance functions, and hence (probably) to different motions. In that section, certain physical phenomena are explained by a hypothesis about the nature of the boundary condition. In other words, the boundary condition is postulated as a cause of the motion, and so it is necessary to regard the boundary condition as a physical constraint. The causal theory of chance is then as follows: 3.1.1 Definition (The causal theory of chance) If £ x represents the dynamical nature of a system X, and bcx its boundary condition, then PX(A) = Pr(A \ £x & bcx). In the general, stochastic case, as well as the deterministic case, £ x does not contain any modal facts, but is just an ordinary physical state of affairs. As we saw in the previous chapter, the logical probability function does not need to be given probabilities in order to yield probabilities. In other words, we may say that probability enters the physical world not by there being probabilistic physical facts, but by there being relations of partial entailment between ordinary physical facts. If this were not so, and the state of affairs £ x made probabilistic assertions, then we would not have any theory of chance at all, as the analysis would be circular. 100 The chance function thus defined will satisfy the axioms of probability, as it is just the logical probability function supplied with particular information. 3.2 Chance is Relativised to a System It should be noted that, in Definition 3.1.1, the chance of an event A is defined for a particular system X in which A occurs. On this view, therefore, there is no such thing as the chance of A simpliciter; one must always specify a reference system, one which includes the event A. 3 If we assume (as I do in the next chapter) that the composition of two systems is also a system, then every event occurs in many different systems. If an event A occurs within X, and X is a subsystem of Z, then A also occurs within Z. As an illustration of the fact that chances are system dependent, consider a system X that consists of a fair, six-sided die together with a device that rolls the die whenever a button is pushed. The button is pushed on some occasion, let us suppose, causing some outcome A, such as a '4' on the die. Since the die is fair, it seems clear that the chance of this outcome within X is 1/6. Now let us consider a larger system Z, however, which includes X as a subsystem. The button in X was pushed by some part of Z, another chancy contraption that lies outside X. If the chance (within Z) of the button being pushed at all is only 1/2, then it appears that the chance of A within Z is only 1/12, not 1/6. We thus write PX(A) = 1/6, and PZ(A) = 1/12, to mark this difference. The fact that chances are defined only for a particular reference system is related to the idea that the chance of a single event can change with time. Consider, for instance, a particle undergoing a random walk within a maze, and let the event A be that the particle emerges from the exit of the maze before some given time, say noon. It may be that at eleven o' clock, perhaps, the particle is only a few steps away from the exit, and so at this time the chance of A 3 I am grateful to Paul Bartha for pointing out this aspect of the causal theory. 101 is quite high. Unfortunately, by 11.30am the particle has moved back toward the centre of the maze, so that its chance of escape before noon is now much diminished. As is shown in §4.6, this time-dependence of chance can be accounted for under the assumption that a time slice4 of a system is itself a system, possessing its own lagrangian and boundary condition. The chance of an event A at time t within X can then be defined as the chance of A within some time slice [t,x] of X, where T is any time after A has finished. 3.3 Lewis's Objections David Lewis (1980) does not propose a definite analysis of chance, but presents a range of possible approaches which includes my own. By a rather circuitous route, which I shall not retrace here, Lewis arrives (1980:97) at the following statement which he considers to be a form of Miller's Principle: PJA) = P ^ ( A | / Y n v & r j . The subscripts will require some explanation. The chance function is written rather than P, as Lewis considers chance to be both time dependent and world dependent. These complications do not concern us here. The proposition Hw is a complete description of the history of world w up until time t, and Tw is a complete "theory of chance" for the world w. A theory of chance is a conjunction of conditionals from history to chance, i.e. a specification of what the chances of events would be for each possible initial segment of w's history. Lewis then considers the suggestion that this formulation of Miller's Principle could serve as an analysis of chance. Clearly, such an analysis would bear strong similarities to the causal theory proposed above in Def. 3.1.1, so let us look at the difficulties Lewis raises against 4 A time slice of a system X is, roughly speaking, the part of X that exists in some time interval [fj,f2l-102 this idea. He sees two separate problems. First, the appeal to P K , which Lewis calls a "reasonable initial credence function", will be illegitimate unless we can give an account of it which does not itself appeal to chance. Lewis believes that such an account will probably require symmetry constraints in order to be sufficiently restrictive about what counts as 'rational', but these constraints are associated with well-known problems. He points out, for instance, "...it is not possible to obey too many different restricted principles of indifference at once and it is hard to give good reasons to prefer some over their competitors" (1980:111). I have argued in Chapter 2, however, that there is a symmetry principle that seems to be a part of logic, and free of paradox, so I shall not discuss this objection again here. The second problem is that such an analysis runs the risk of circularity, on account of the use of Tw. For this term, which appears in the analysans, itself involves the concept of chance. In order to avoid this objection, Lewis notes (1980:111-112), we must replace Tw with a "Humean" statement - one which only describes matters of particular fact, and not anything modal. The only possible way to accomplish this, Lewis believes, (correctly I think) is if each world has the same theory of chance. This would make the theory of chance necessary rather than contingent.5 According to the causal theory of chance, the history-to-chance conditionals are provided by the dynamical properties of the system, together with logical probability. The dynamical properties lx are independent of the actual history of X - in Lewis's terms they are the same in every nomically possible world. Moreover, the same relations of logical inference hold in all worlds. Thus, my causal account of chance is one according to which the "theory of chance" is necessary.6 In his discussion, Lewis does not refer to intrinsic dynamical properties of systems, such as geometrical shape, viscosity, elastic constants, masses, charges and so forth, 5 O f course chances themselves would stil l be contingent, as they depend upon the history of the wor ld as well as on the theory of chance. 6 Note that the history-to-chance conditionals are not exactly of the form: " i f the initial segment of the actual history is Hv then the chance of E is p". Rather, they are like: "given that the initial segment of the actual history is Ht (and the dynamics) one can infer E to degree p". The difference is not too important, however. 103 so it is not clear whether he is willing to countenance such things.7 Their importance to my theory is highlighted by Lewis when he points out that (1980:113) ...according to [the idea that the theory of chance is necessary] i f I were perfectly reasonable and knew all about the course of history up to now (no matter what that course of history actually is, and no matter what time is now) then there would be only one credence function I could have. Any other would be unreasonable. I t is not very easy to believe that the requirements of reason leave so little leeway as that. Not very easy indeed! This is surely a huge understatement, as (from my point of view) it makes the dynamical properties of a system logically necessary! I am not sure what it would even mean to assert that this pendulum has a string-length of 28.5cm by logical necessity. If one grants the existence of lx, however, then one's credence is not stretched nearly so far. 3.4 A Proof of Miller's Principle Since the causal theory of chance was motivated by the desire to account for Miller's principle, to make it not a mere coincidence, we should ensure that the principle holds for it. Consider some event A, about which the only relevant information you have is that the chance of A is p. Given only this knowledge, is the epistemic probability of A also pi I shall now show that it is, using two premises. First I need a screening-off assumption, and second the Authority of Logic principle of §2.2.3. If we let U be the state of affairs £x&bcx, where X is any system in which A occurs, then the screening-off assumption is that Ky is an authority for one's own epistemic state. In other words (since Ky is obviously veridical) we assume that Ky is superior to one's own state. This 7 I suspect that he would not, although from the point of view of an applied mathematician this seems untenable. 104 assumption will not always be true, of course, but then (as noted in §3.1) Miller's Principle has counter-examples as well. The derivation of Miller's Principle will be adequate, therefore, if these two sets of counter-examples coincide exactly. That is, we require that Miller's Principle hold in all and only those cases where the knowledge in K is screened off by knowledge of lx&bcx. 3.4.1 Theorem (Miller's Principle) Let U = £x&bcx. Then, if KV > K, then PK(A \ Px(A)=p) = p. Proof By the causal theory of chance, Px(A)=p means that Pr(A \ U) =p. Then, supposing that KJJ > K, we have that KV is an authority for K, so that PK{A \ Pr(A \ U)=p) = p. Thus PK(A | Px(A)=p) = p, as required. • We see that, since chance is defined using logical probability, it is also authoritative in the same way. One might therefore call Miller's Principle the Authority of Chance principle. It should be noted that this conditional form of the principle answers the previously difficult question8 of which epistemic states Miller's Principle applies to. Any proposition is admissible, in Lewis's sense, provided it is screened off by lx&bcx. 3.5 The Objections of Howson and Urbach Howson and Urbach (1993: 342-344) criticise a theory of chance which they simply call 'the theory of objective chance'. According to this theory, a stochastic system (such as a coin, for instance, in a coin-tossing experiment) has a certain chance of producing each of its possible 8 Lewis , for instance, offers no criterion to pick out "admissible" propositions, i.e. those propositions such that knowledge of them does not render Mi l ler 's Principle inapplicable. 105 outcomes. Thus, for instance, there is a certain chance that the coin will land heads, and this number attaches to a single trial, rather than to a collective of outcomes. The problem of showing that the chance function is a measure is solved by postulating that Miller's Principle holds, so that "chances license numerically equivalent degrees of belief. By tying chances to coherent degrees of belief, which (according to a plethora of arguments) must obey the axioms of probability, it follows that chance must be a probability function as well. Howson and Urbach are not critical of the theory as described so far, but rather of Levi's additional claim (Levi, 1980: 258) that no characterisation of chances independently of Miller's Principle is either possible or necessary. Chances are left then as mysterious, postulated entities which justify corresponding degrees of belief. As they point out, this is rather unsatisfying and we should try to do better. Howson and Urbach also attack Lewis on the same grounds, although less justly since Lewis does not actually offer any theory of chance, and hence does not offer this theory. Lewis does however place the following constraint on any proposed theory: "I would only ask that no such analysis [of chance in terms of epistemic probability or relative frequency] be accepted unless it is compatible with the Principal Principle" (1980:90). As a necessary condition for a theory to be acceptable this is beyond reproach, but as a sufficient condition ("I would only ask...") it is much too weak, and does leave him open to a related objection. For if an analysis of chance is merely compatible with Miller's Principle, and does not entail it, then the latter is left unexplained. As I argued in §3.1, this would be like the unexplained numerical equality of inertial with gravitational mass in Newtonian mechanics. An adequate theory of chance should explain why, in certain cases, chances and epistemic probabilities are numerically equivalent. The causal theory of chance faces no objection along these lines since, as is shown in §3.4, Miller's Principle may be derived from Definition 3.1.1 106 3.6 Chance and Relative Frequency If there is such a quantity as physical chance, then it is surely connected in some way with relative frequency. More precisely, for a given experiment that has A as one of its possible outcomes, there is surely some link between P(A) and the relative frequency of outcomes similar to A in a large class of similar experiments. In this section it will be shown that, according to the causal theory of chance, there are two links between chance and relative frequency. The first link between chance and frequency is epistemological. The chance of an event may be inferred, albeit approximately and fallibly, by measuring the relative frequency of similar events in a large class of repeated trials. Chance, in other words, is empirically accessible, in that it can be discovered (approximately and fallibly) by empirical means. The second link between chance and frequency is concerned with explanation. Chances are often cited as explanations of a certain physical phenomenon, namely the tendency of an outcome type of a given kind of trial to be associated with a stable relative frequency. In quantum mechanics, for instance, one explains the distribution of actual positions of photons detected at a screen by positing a chance distribution for each single photon. One says that the density of detected photons is greatest at this point because that is where the chance of detection is highest for each photon. To show that the causal theory of chance accounts for both of these connections between chance and frequency, we need first to define some terms. The standard experimental situation in which frequencies are discussed is when a certain type of experiment is performed many times. We can either suppose that the same apparatus is used for each experiment, being completely reset before each new trial, or that many exactly-similar sets of apparatus exist. In the former case the experiments are well-ordered by the times at which they occur, and in the latter case we shall suppose that some other well-ordering exists. Let K be the set of possible 107 outcome-types on each experiment, and F some field of subsets of K. Then the combined outcome of all the trials together may be represented by an outcome sequence such as s = <Sj , s2, s„, ...>, where each s, is a member of K. In the case where the total number of experiments is some finite number n, we can define the relative frequency of each member A of F as the number of times A occurs divided by n. Writing the relative frequency of A in s as f(A,s), we have that f(A,s) = (#A in s)/n. If the number of experiments is countably infinite, then we can still define/„(A,s) as the relative frequency of A in the first n terms of s, and then we define /(A,s) = lim„_^/„(A,s). This limit only exists for some outcome sequences, of course. In cases where the sequence /„(A,s) does not converge, the quantity/(A,s) is not defined. In circumstances where it is clear which outcome sequence is in question we shall write '/(A)' instead of '/(A,s)'. Now let us look at the issue of whether chance is empirically accessible. Suppose an experiment is performed and a certain outcome obtained, say A. If one desires to know what the chance of that outcome was, one can repeat the entire experiment ab ovo a very large number of times and record the outcome in each case. If one goes to the trouble of calculating the relative frequency of As in the first n trials, for every n, then it usually appears that the values are converging towards a limit9. The experimenter then has a high degree of confidence that the chance of A lies in a certain interval centred on the relative frequency of A for the entire 9 I say "appears", since convergence is a property only infinite sequences can have, and furthermore is a "tai l event", depending only on the final (infinite) segment. Breiman (1968:40) defines a tail event as fol lows. "Let X j , X 2 , . . . be any process. A set E e a ( X ) w i l l be called a tail event i f E e o(Xn, X „ + 1 , ...), for all n." 108 class. The problem is to show that this reasoning is justified, if the causal theory of chance is correct. The reader may have noticed that the concept of relative frequency does not appear in Definition 3.1.1, so one may doubt that the theory can account for the fact that chances are empirically accessible. In this section I shall show that such fears would be unfounded. The first thing to realise is that, according to generally-accepted ideas about chance, a statement about the chance of an event is not logically equivalent to any statement which is purely about frequency.10 This is quite clear for frequencies in finite sequences of outcomes, but is also true for countably infinite sequences of trials. Consider, for instance, a fair coin which is flipped ten times. The chance of heads is 1/2 on each toss, but the relative frequency of heads may or may not be 1/2. The relative frequency of heads may take any value in the list 0, 1/10, 2/10, 9/10, 1. If we assume that the individual trials are pairwise probabilistically independent,11 then using the axioms of probability we can calculate the chance of each possible relative frequency in this set of experiments. It turns out that the most likely frequency is 1/2, with a chance of about 0.25. Now suppose we increase the number of trials from ten to one thousand. Again, the relative frequency of heads could be anything from 0 to 1 in steps of 1/1000, but now there is a 95% chance that the frequency will lie between 0.47 and 0.53. The general picture is that as the sequence of trials is lengthened, the chance distribution over possible frequencies tends to a Gaussian normal distribution, which becomes increasingly peaked around the physical chance. What happens then for a countably infinite sequence of experiments? In the infinite case, we are of course concerned with the limiting relative frequency. One result, known as the Strong Law of Large Numbers, states that (for our coin-tossing experiment) the limiting relative frequency almost surely exists and is equal to 1/2. If one enquires about the meaning of "almost surely", then the usual answer is that an event occurs 1 0 I shall use 'frequency' as an abbreviation for 'relative frequency'. • 'The physics involved in this assumption is examined in detail in §4.4. 109 almost surely if and only if its chance is one. Using the definitions given in this thesis, however, this is not strictly correct. For, if the chance of this event (getting a frequency of 1/2) were one, then the contract [$1 if/(>7)=l/2] would be worth exactly $1.12 This is not the case, however, as possessing $1 dominates possessing the contract [$1 if/(/7)=l/2]. That is to say, there are possible states of affairs where it is better to have $ 1 than the contract, but no states of affairs where it is better to have the contract. One such state of affairs is where the outcome sequence consists entirely of tails, so that the relative frequency of heads is zero. No one, as far as I know, denies that this is a possible event, and yet in this case the contract yields nothing. What then is the contract worth, if not $1? Its value cannot be any real number strictly less than one, such as 1-8, for then it is easy to find some other contract C which is worth 1-8/2, yet [$1 if/(/7)=l/2] dominates C. But now we are in a fix, because there are no real numbers left! We must conclude that this contract has no exact value among the real numbers. It is worth more than all the values in [0, 1), but less than $1. If an event has a chance of one, then it occurs necessarily, in the nomic sense. This event f(H)= 1/2 is not nomically necessary, and does not have a chance of one, strictly speaking. Since however one is the best approximation to the chance, in some sense, we shall write P(f(H)=U2)~\. The statement that P(/7)=l/2 does not (together with the independence assumption) entail that/(//)=l/2, so it is impossible to regard the two statements as equivalent. As we shall see in Section 3.7.1, this fact is the Achilles' heel of all frequency theories of objective probability. The Strong Law of Large Numbers is an example of what I call a chance-chance statement. It is a conditional, where both the antecedent and the consequent are about chances. It may be expressed in the form: "if P(A) = p, then P(f(A)=p)~l". The conclusion so far about the link between chance and frequency is that there are no true chance-frequency statements. 2 This is for someone who knew only £&bc. 110 That is, there is no true statement of the form: "If P(A)=p, then necessarily f(A)=p", or "If P(A)=p, then if an infinite sequence of trials were performed,/(A) would (necessarily) equal p", and so on. It is impossible to deny that there is a link between chance and frequency, of a kind which allows us to learn about chances by observing frequencies, so it is perhaps surprising that there are no true chance-frequency statements. Can chance-chance statements serve instead? My view is that they can, but only if we can also make use of Miller's Principle. It is most helpful here to consider finite laws of large numbers, ones of the form: "if P(A)=p, then P(fn(A)e [p-e, p+£]) = 1-8", where 8 and 8 depend upon n. Suppose we flip a coin 1000 times, and obtain /(/f)=0.51. Intuitively, this should increase our confidence in the hypothesis P(H)=\I2 at the expense of hypotheses like P(H)=5/6. An inference of this kind requires Bayes's theorem, which yields PK(P(H) = V HH) = 0.51) = W W = o ^ « « W W * f l ) = *>, * 2 PK(f(H) = 0.51) The crucial terms here are the ones like PK(f(H)=0.51\P(H)=l/2). It is easy to derive a statement like "P(f(H)=0.5l) = p" from the premise that P(H)=\/2, and so we should like to replace uPK(f(H)=0.5l\P(H)=l/2)" with '/?'. If we can replace epistemic probability with chance in this way, then there is the required modification of the epistemic probability function, as the most likely chances according to PK become those close to the observed relative frequency.13 It must be stressed that this replacement of epistemic probabilities with chances is all we require in order to do the calculation.14 1 3 O f course one also needs a sufficiently uniform prior epistemic probability distribution over the possible values of the chance, but this is a different problem. 1 4 Thus , for instance, Lewis (1980.106-108) has the resources to carry out this calculation, even though he does not endorse any particular theory of chance. I l l Such a substitution of lP' for 'PK' is, of course, licensed by Miller's Principle, as follows. Since P(H)=\/2 entails15 that P(/(/7)=0.51) =/?, we have: PK(f(H)=0.5l I P(H)=l/2) = PK(f(H)=0-5l I P(H)=\/2 & P(f(H)=0.5\) = p) PK(f(H)=0.5l\P(f(H)=0.5\)=p), since T(/(/V)=0.51) = /?' screens off 'P(H)=V2' here, as it contains everything in '?(#)= 1/2' relevant to PK(f(H)=0.5l). This, in turn, yields PK(f(H)=0.51 | P(H)=\/2) = p, by Miller's Principle. Thus what is required of a theory of chance, in order for chances to be empirically accessible, is that (i) The theory entails that chances are probabilities, in the formal sense, and (ii) The theory entails Miller's Principle. Now let us look at the second problem, of showing that hypotheses about chance can be used to explain relative frequencies, at least to some extent In §1.5.5 it was argued that to explain an event A one must infer A, to some degree, from a hypothesis about the causes of A. One must demonstrate that, given the causes of A, one should have some (fairly high) degree of belief that A occurs. Now suppose that a coin X is flipped a very large number of times, and in the course of these experiments the relative frequency of heads seems to be converging to somewhere around 0.41. According to the causal theory of chance, can one explain this datum f(H) ~ 0.41 by means of a hypothesis like PX(H) = 0-41 (on each trial)? 1 5 O f course one must also assume independence, so we really have P£(/(/ /)=0.51 | P(H)=l/2 & independence). 112 Using Definition 3.1.1 the statement PX(H) = 0-41 means that the coin has some lagrangian16 £ x such that Pr(H\lx) = 0.41 (on any single trial). As we have already seen, it follows from this that Pr(f(H) ~ 0.41 Ux) is close to one. Thus it is clear that a hypothesis about lx can be used to explain the datum f(H) = 0.41, as lx is a purported cause of the datum, from which the datum may be inferred. In making a conjecture that Pr(H\lx) = 0.41, one does not say what lx actually is; nor does one actually make the partial inference from lx to H. One merely claims that some such lx exists and that the inference from lx to H (to degree 0.41) is valid. In spite of this vagueness in the hypothesis, however, the claim that PX(H) = 0.41 is nonetheless a statement about the cause of the datum f(H) ~ 0.41, and it does entail the datum (with near certainty). One can therefore explain an observed relative frequency using a hypothesis about a chance, although it is better to use a more direct hypothesis about lx if possible. In conclusion, we find that although the causal theory of chance does not explicitly refer to relative frequencies, it nonetheless accounts for the two ways in which chance and frequency are connected. It delivers the results that (i) chances are empirically accessible, by measuring relative frequencies, and (ii) hypotheses about chance may be used to explain observed relative frequencies. 3.7 Frequency Theories of Probability There are other interpretations of the probability calculus in terms of purely physical properties, which may be described as frequency theories of probability. The first sophisticated formulation of this idea is due to Richard von Mises, but many others have followed in his footsteps, producing their own versions. Among these we shall consider those of Popper, Howson and Urbach, and Van Fraassen. 1 6 For the sake of the example I am assuming that the coin is genuinely stochastic, and that the chance of heads is the same for each toss (i.e. independent of the boundary condition). 113 Frequency theories are founded upon two common observations. The first is that relative frequencies are, formally speaking, probabilities.17 In any outcome sequence s, for instance, if A and B are mutually exclusive, then f(AvB,s)=f(A,s)+f(B,s), which is an axiom of probability. The other axioms also hold, including even the Principle of Conditioning, as we can define f(A \ B) as the relative frequency of A in the sub-sequence of terms which are members of B. Second, there is a close link between physical probability and frequency, which we examined in the previous section. There are two basic kinds of frequency theory which, following van Fraassen (1980:181, 190), I shall call strict frequency theories and modal frequency theories. A strict frequency theory, such as that proposed by Reichenbach (1949) states that the objective probability of an event-type is its relative frequency in the actual sequence of outcomes sa, i.e. P(A) is just/(A,sa) by definition. The proper response to this view seems to be as follows. It cannot be denied that relative frequencies among outcomes of actual experiments are objective quantities, and moreover that they are probabilities in the formal sense. Thus, one might well call them "objective probabilities". They are not, however, what we mean by physical chances, as was shown in §3.6. It is an essential property of chance that it does not strictly determine a unique relative frequency, but merely provides a chance distribution over the the various possible relative frequencies. In short, frequency is one thing and chance is another. Reichenbach cannot claim that he provides an analysis of our common notion of chance, although he may hold that that notion is bankrupt, and that his alternative is sufficient for the needs of science. In response here I would say first that the notion of chance is not bankrupt, since the causal theory gives a good account of it. Second, we cannot do without chances, as they are so deeply involved in quantum mechanics. 1 7 This is not strictly true in the case of an infinite sequence of experiments, but I shall not pursue the matter here. 114 The second kind of frequency theory, the modal frequency theory, is far more plausible and has many distinguished advocates. There are many different versions of it, however, which we shall consider in turn. 3.7.1 VonMises Von Mises's frequency theory (1928) starts with the idea of a collective. A collective is an infinite outcome sequence s with the following two properties. (i) The Axiom of Convergence: /(A,s) exists for each A in F. (ii) The Axiom of Randomness: If we take any subsequence s' of s, which is picked out by a 1-1 increasing recursive function, then/(A,s') also exists and is equal to/(A,s).18 Now let us consider some experimental set-up, and one possible outcome A for that experiment. The objective probability of A, which we shall write P(A), exists if and only if (i) this set-up would (on an infinite sequence of experiments) produce some collective or other, and (ii) any two collectives s{ and s2 which might be produced are such that /(A^^ftA^), for every A in F. If these two conditions are satisfied then we can define P(A) =/(A,s), where s is any collective which might be produced by the experiments. (I assume that von Mises' claim is that an experiment for which objective probabilities are defined necessarily produces some collective or other, but not necessarily any collective in particular.) 1 8This formulation is actually due to Alonzo Church (1940). It sharpens up von Mises's original idea. 115 It is clear that this definition is in conflict with the common-sense notion of chance, as according to von Mises it is impossible for a fair coin, i.e. one with P(H)=\I2, to land heads every time in an infinite set of tosses. The term 'P(H)' is not even defined unless the coin necessarily generates a collective. Moreover, it is hard to see how the collective <H, H, H, ...> could be impossible if, on each individual trial, the outcome H is possible for that trial. It would seem to require that some of the experiments, at least, "know" the outcomes of other experiments, even though they are supposed to be carried out independently. Worse still, it is not just the collective <H, H, H, ...> which is excluded for a process with P(H)=l/2, but an uncountably infinite set of other collectives. And then, of course, the uncountably infinite set of non-collectives is also ruled out. On the face of it at least, and according to standard probability theory, each of these is individually just as possible as any one sequence in the select group of collectives. Let us call this problem, that frequencies do not necessarily equal probabilities, even in an infinite class of trials, the cardinal problem for frequency theories. Excluding "communication" between experimental trials, the only way to restrict the outcome sequence to a select group of collectives is by the outcomes, as a group, being determined by something or other. This idea is also extremely problematic, however, for two reasons. First, it would mean that the first toss is determined perhaps to land heads, the second heads also, the third tails, and so on, even though the experiments are supposed to be set up under maximally similar initial conditions. This is a contradiction, so we would have to say that the initial conditions vary slightly from experiment to experiment. Then, however, it is clear that these mysterious variations are faced with the same difficult task of guaranteeing that the outcome sequence is a collective with the right frequency. They cannot be "random", but must be rigged to follow some pre-set pattern. There must be something in the background, pulling the strings so to speak. Second, if the trials are all deterministic, then one is inclined to agree with Lewis (1980:117-121) that the genuine chance of heads varies from trial to trial, being either 0 or 1 in 116 each case. The number 1/2 is then some sort of "counterfeit chance", depending upon human ignorance. It is not determined by the experimental arrangement itself. This objection (that chance set-ups would not necessarily produce collectives upon infinite repetition, let alone collectives with the right frequencies) seems to be quite fatal to all modal frequency theories, so it is surprising that it is so rarely mentioned in the literature. The only reason I can see for ignoring it is as follows. In recent discussions of the measurement problem in quantum mechanics, there is much talk of two quantities being equal FAPP, i.e. For All Practical Purposes.19 For instance, if a density matrix is such that it is practically impossible to distinguish it experimentally from a classical mixture, then one says that the state is a classical mixture, FAPP. To many who work on the experimental end of the subject, this seems a completely satisfactory solution to the measurement problem. The more theoretically inclined, however, feel very uneasy about it. The situation with chance seems rather similar. For any sensible person, the distinction between events which are necessary and those which occur almost surely is too fine to see. If two propositions are equivalent almost surely then they just are equivalent, FAPP. Anyone who insists on a difference here is just being perverse. Am I being perverse? I think it is important to distinguish between practical and theoretical contexts. In a practical context one can normally treat an event whose probability is as low as 0.9999 as being certain to occur, and it would indeed be perverse to express doubt about an event which occurs almost surely. In a theoretical context the situation is quite different. It is a matter of simple logic. If two propositions are equivalent, so that one can be analysed as the other, then they entail one another. Therefore, since the statement "P(/T)=l/2" does not entail "if the experiment were repeated infinitely many times, we would necessarily have/(/7,s)=l/2", they are not equivalent. In theoretical contexts, a miss is as good as a mile. 1 9 I believe this acronym was first used by Bel l (1990:33). 117 It should be noted that the cardinal problem is universally recognised with regard to defining objective probability as relative frequency in a finite sequence of outcomes. No one, for instance, defines P{H) as the relative frequency of heads which would exist in a sequence of 1000 tosses, since there is no single relative frequency which would necessarily occur in this experiment. The use of an infinite experiment constitutes a 'FAPP' solution to the cardinal problem. The unwanted relative frequencies are ignored by making them very, very improbable. A more common objection is that von Mises's probabilities cannot be determined empirically. Let us call this the problem of empirical access, which is also common to all frequency theories. The problem is that the limit of a sequence depends only upon the infinite final segments, so to speak, and is entirely independent of any finite initial segment. Yet, of course, only finite initial segments are open to examination. More precisely, the convergence of the outcome sequence is what is known as a tail event,20 i.e. the limit of the sequence/n(A,s) is unaffected by any change in the first r terms of s, for any finite r. Thus, for any finite r, the first r terms of s tell us precisely nothing about the value of/(A,s).21 Another way to express this objection is to consider what we might call deceitful collectives. A deceitful collective, with f(A)= 1/2 say, is one which seems to be converging on some other relative frequency for (say) the first billion terms. For instance, it may be converging quite convincingly to 3/4 in the first billion terms, but then suddenly switch and converge to 1/2 thereafter. Since whether or not an outcome sequence is a collective depends only upon its limiting behaviour, such "deceitful" sequences are perfectly good collectives. We may contrast deceitful collectives with honest ones, that is, ones which begin to converge to their final limit right from the start. Now suppose we repeat a type of experiment one billion times, and obtain a relative frequency which is close to 3/4. Can we take this as very good evidence that P{H) is close to 3/4? 2 0 Seeno te 13. 2 1 F o r a detailed discussion of this objection see Howson and Urbach (1993:331-337). 118 To keep things simple we shall compare two hypotheses, that P(H)=3/4 and that P(H)=l/2. To be fair, we shall assume that these have the same initial (epistemic) probability. The question is now whether this evidence of one billion trials causes the hypothesis that P(H)=3/4 to become much more probable than its alternative. The crucial terms in a Bayesian evaluation of this question are Pe(fn(H,sa)=3/4 \ P(H)=3/4) and Pe(fn(H,sa)=3/4 | P(H)=\I2). For long-run frequencies to be empirically accessible, the first term must be much greater than the second. Intuitively, the ratio of these two quantities depends upon the relative probabilities of two classes, the deceitful collectives, s', with f(H,s')= 1/2 (which converge to 3/4 in the first billion terms) and the honest collectives, s, with f(H,s)=3/4. All of these sequences yield the observed datum. Now, since the two classes of collectives, those with/(//)=1/2 and those with f(H)=3/4, are assumed to have equal prior probability, the question is about the relative probabilities of honest and deceitful collectives within classes of fixed f(H). Von Mises requires that, among collectives with f(H)=1/2 for instance, the class of honest sequences is much more probable than the class of deceitful ones. This difference of probability cannot be a matter of cardinal power, for both classes have the power of the continuum. Moreover, since von Mises has nothing in his account to suggest any privileged measure over the set of collectives, it is hard to see where it could come from. This problem of empirical significance therefore appears to be quite fatal to von Mises's account.22 3.7.2 Popper Popper's theory of objective probability (Popper, 1959) is almost the same as that of von Mises. The only difference seems to be that Popper is willing to ascribe probabilities to particular events, rather than merely to types of event. Popper, like von Mises, defines the probability of 2 2 H o w s o n and Urbach claim to have solved this problem, however. We w i l l discuss their "Bayesian reconstruction" of von Mises in §3.7.3 below. 119 an event type as the relative frequency of that event type in all the collectives which might be generated by an infinite repetition of the experiment. Now, for Popper, these imagined repetitions are supposed to be carried out under exactly similar conditions; thus, assuming/(A,s) differs from 0 and 1, it follows that the outcomes are irreducibly indeterministic. (If they were deterministic, then the same initial conditions would invariably lead to the same outcome.) Thus the probability is determined entirely by the experimental conditions on a single trial: these conditions determine the possible collectives which might be generated upon infinite repetition, and the collectives determine the probability. For this reason, Popper describes his theory as one according to which there are objective, single-case probabilities. These are seen as the propensity for experiments of that type to produce collectives of a particular relative frequency. Popper seems to be quite correct about this. The puzzle is why von Mises and others insist that probability is irreducibly a property of the whole collective, and cannot be attached to a single event. Howson and Urbach (1993:340) do give an argument, on von Mises's behalf, as to why probabilities are not defined for a single experiment. They point out that, in a coin tossing experiment for example, there are many causes which combine to produce the outcome in a single case. In addition to the intrinsic properties of the coin there are variations in air density, convection currents, the precise way in which the coin is released, and so on. This is quite true, of course. They then assume, however, that the outcome of a toss is uniquely determined by a set of parameters qx, q2, qk. This leads to a serious problem for Popper, as it means that an exact repetition of an experiment will yield the same result. It then follows that a single experiment can only define one of the two collectives <T, T, T, ...> and <H, H, H, ...>, and so all single-case probabilities are either 0 or 1. The way to avoid this unwelcome triviality, Howson and Urbach contend, is to give up the idea of single-case probability, but it is not clear to me how one is to draw this conclusion. A more attractive response for Popper is surely to deny the assumption of determinism present in the argument. After all, any probabilities one defines for a deterministic process cannot be 120 genuine chances. Howson and Urbach consider this response, but are wary of basing a theory of physical probability upon the metaphysical assumption of indeterminism. They explain that "Our objective is to find a theory of objective probability that will fit in with the practice of statistics" (1993:341), and of course the practice of statistics will continue regardless of whether or not the world turns out to be irreducibly stochastic. It seems to me that the view that objective probabilities are not defined for a single experiment is equivalent to saying that physical probabilities do not belong to complete descriptions of an experiment. (A single, concrete, instance of an experiment type defines an exact description.) Yet if probabilities do not belong to complete descriptions of experimental conditions, then surely the only alternative is to attach them to incomplete specifications. Now, if an incomplete description is needed, then question arises as to how incomplete it will be, and in which respects. In other words, one is faced with the old "problem of the choice of reference class" (1993:340), which Howson and Urbach are anxious to avoid. It is quite unclear to me how an appeal to the collective is supposed to help here. The collective23, after all, is the very thing this incomplete description is supposed to specify! At best one might refer to some set of more than one actual experiments, but this would surely be no better than a single trial at defining the exact conditions for an infinite sequence of trials. To summarise, it seems that the only way to avoid the problem of the choice of reference class is to say that each experiment is performed with maximally similar initial conditions, as this leaves one with no choice to make. Of course, for a deterministic system this means that the probabilities defined will be trivial. 2 3 M o r e precisely, its l imit ing relative frequency. 121 3.7.3 Howson and Urbach We now return to the problem of empirical access, which Howson and Urbach claim to have solved on behalf of von Mises. The basis of their position is a version of Miller's Principle, for which they offer a proof. Their version of Miller's Principle (1993:345), applied to the case of a coin-tossing experiment, is that PK(H \f(H)=p) = p. In other words, if you know that a certain coin produces a collective with frequency p (w.r.t. the property heads), and not too much else that is relevant, then your epistemic probability for heads (on any particular trial) should also be p. More briefly, knowledge of the frequency licenses a numerically equivalent degree of belief. Since Colin Howson is responsible for the chapter in which this principle appears, I shall refer to it as Howson's Principle. If Howson's Principle is true, then there certainly is no problem of empirical access for von Mises. The Principle provides exactly the probability measure required, in other words, to rule out deceitful collectives as massively improbable compared to their honest competitors. It says that our beliefs about finite initial segments of outcome sequences should depend strongly upon what we suppose to happen in the limit. This is counter-intuitive, in view of the fact that the limit of the sequence fn(H,s) is entirely independent of the first r terms of s, in the sense that the initial segment of s can be altered at will without any effect on/(/Y,s). We should therefore examine the argument for Howson's Principle. (I have altered the terminology of this to fit with my Chapter 2, but the substance is the same.) Howson considers the situation where a coin is tossed a certain number of times to produce an outcome sequence s, where each term is H or T, and a person X is given an opportunity to buy the contract [$1 if s:=H]. X is given only the information that, if the tosses were continued indefinitely, the outcomes would constitute a collective with f(H)=p. Suppose that X considers some amount $p' to be a fair price for this contract. X should then also be willing to pay $p' for the contract [$1 if s2=H], and indeed for any contract of the form 122 [$1 if sr=H]. Let us then suppose X is willing, as he should be, to buy all of these contracts. In this case, after n trials in which m had the outcome heads, his net gain is m-np', i.e. n(m/n - p'). Now since the limit of m/n as n tends to infinity is p, if p'>p then there is some n0 such that X has made a loss after n0 trials, and this loss is never recovered thereafter. It follows therefore that $p' is too much to pay for the contract, since it leads X to a purchase which is a certain loss. By a similar argument anything less than $p would be too little, and so $p is the fair price. Thus, by the definition of epistemic probability, PK(H | P(H)=p)=p, which is the required result. The argument seems to consider the case where X buys all the contracts [$1 if s=H], for every r, at the price $p' each. X then loses, and gains, infinite amounts of money. These quantities are not comparable, so it makes no sense to ask whether X loses or gains overall. In the case where X buys all the contracts, therefore, he is not obviously irrational, and so the value p' is not obviously incorrect. But wait a minute! Howson showed, apparently contrary to this, that after some finite number n0 of trials X "goes into the red" and never comes out! This reasoning seems to indicate that the infinite losses and gains, taken together, should be considered an overall loss. How can there be this conflict? The apparent contradiction is due to value being a cardinal quantity, whereas the limiting relative frequency depends essentially on the ordering of the outcomes. Let us suppose p - 1 / 2 , and imagine the losses and gains to be measured in gold coins. Then each losing contract costs X one coin, and each winning contract pays out one coin. It may help to think of each coin being marked with the numeral for r, if it is concerned with the r t h trial, and suppose that each coin is only used once. After all is finished, therefore, there are two piles of coins, one of coins lost and the other of coins gained. These two piles of coins are entirely independent of how the trials are ordered. Any re-ordering of the trials may re-order the piles, but exactly the same coins, marked with the same numerals, will be present. Wealth is a cardinal quantity; it does not depend upon ordering. On the other hand, a re-ordering of the trials may have a very large effect on f(H). After re-ordering, f(H) may have any value between 0 and 1, or indeed no value at all, as the sequence 123 fn(H) may no longer converge. Thus, the fact that the sequence nimln - p') goes permanently negative at some point does not mean that X suffers an overall loss. By a well-chosen reordering of the trials, this sequence will instead go permanently positive at some finite point, but still the piles of coins are the same! Wealth cannot be measured by anything which is sensitive to ordering in this way. As I describe the situation, X buys all these contracts as a bundle, i.e. "all at once", so to speak. In Howson's argument, however, it seems rather that X buys them "one at a time", each one just before the coin is tossed, although of course X is not permitted to see the outcomes. (If X gets to see the outcome of each bet before deciding whether to place the next, then he has more information about the outcomes than merely that they are a collective with frequency p.) I do not see, however, how this difference of presentation makes any concrete difference, however. In both cases X decides to buy all the contracts, and makes certain gains and losses. There remains this fact that, for some finite n0, X will dip into the red at n 0 and never return to the black. Although this is a logical consequence of the fact that f(H)=p, i.e. it follows from a fact about the infinite tail of the outcome sequence, it looks rather like a finite consequence. Indeed, I think the initial plausibility of Howson's argument stems from this equivocation. The illusion can be improved if we suppose that all the coin tossing is done in advance, with the results kept secret, so that we are dealing with a particular collective s. There is now some definite number k, (let us say 865 for definiteness) such that X's final slide into ruin begins at k. X even knows that there is such a k, although he does not know the actual value. One may feel that it must be possible to construct some irrational bet for X which is forced upon him by his being willing to pay $p' for [$1 if s=H]. If buying all these contracts is not irrational, as the gains and losses go to infinity and become incomparable, then perhaps buying some finite initial segment is? It does not take much thought, however, to see that this is not the case. If X buys all the contracts up to r=n say, then whatever the value of n it could still be less than k, for all X knows. The only way to ensure that X keeps buying contracts after k is to make him buy them all. 124 It is quite clear that Howson's Principle cannot be justified using an argument of this type, so his attempt to rehabilitate von Mises's theory fails. Also, even if Howson did solve the problem of empirical access, the cardinal problem of frequency theories would remain. 3.7.4 Van Fraassen Van Fraassen's frequency theory (1980:190-196) is essentially the same that of von Mises and Popper24, although there are some improvements in matters of detail. He defines a good family as a class of collectives with certain properties. These restrictions on the class of collectives used in the definition of probability mean that the probability function defined is a countably additive measure. These refinements do not, of course, have any impact on the two problems which face frequency theories, so let us look at what van Fraassen has to say about them. Van Fraassen does not mention the cardinal problem for frequency theories. His concern is to define a set of collectives which generate relative frequencies with the right formal properties. He appears not to notice that the outcome sequences excluded from the "good family" are all quite possible, in any reasonable sense. It is hard to see what he might be thinking. There is however an attempt to deal with the problem of empirical access. (1980:194-196) Van Fraassen's basic idea is to regard a finite sequence of actual outcomes as a "random sample" from one of the collectives in the good family. He likens this random sampling to the random selection of marbles from a finite barrel where, as is well known, the proportion of marbles with a given property A determines a sampling distribution for A. Thus, examination of a finite set of randomly-selected marbles gives useful information about the proportion of A in the barrel. 2 4 A t least, as I read Popper. Van Fraassen somehow interprets Popper as holding that there is a single virtual sequence which would arise in an infinite sequence of trials. This interpretation makes Popper inconsistent of course, as he is committed also to indeterminism. 125 Van Fraassen seems willing to consider cases where the actual set of experiments is infinite, even though we only know the outcomes for a finite number of them. It seems like a good idea to discuss his response to the problem of empirical access in this context, as it means we are dealing with a particular collective s. For definiteness, let us suppose that we are dealing with tosses of a fair coin, so that f(H,s)= 1/2. Van Fraassen then regards the outcome of a single toss as a random selection from s. Since, in a manner of speaking, one half of the terms in s are heads, it seems that the probability of heads on a single toss is also one half. What kind of probability is this? Is it physical or epistemic? It surely cannot be physical, for van Fraassen has already given an account of physical probability, and it did not involve the notion of random selection. So it must be epistemic probability; we have something rather like Howson's Principle, in that the epistemic probability of heads (presumably given that f(H,s)= 1/2) is one half on each toss. We found that Howson's argument for his principle was fallacious, so we should see if van Fraassen has anything better. Unfortunately there is no argument provided. Instead van Fraassen merely tries to convey "...an intuitive idea of how the theoretically predicted frequencies are derived" (1980:195). The intuitive idea is that we regard the collective s as a barrel of marbles. Now, everyone agrees that if a barrel of marbles has a known relative frequency of 1/2 black marbles, then the epistemic probability of an arbitrary marble being black is also 1/2. Thus, it seems, the epistemic probability that an arbitrary term in s is H is then just the relative frequency of H in s. We should note first that the relative frequency of s depends essentially on the ordering of terms, as mentioned before. The usual idea of a "random element" of a set S, in the epistemic sense, is of an object whose only known property is that it is a member of S. Thus, if a random term in s has a probability 1/2 of being heads, then the ordering of s must be given. If this is so, however, then it is quite unclear how it can be shown that Pe(st=H |/(/7,s)=l/2) = 1/2. In the case of a finite barrel of marbles, one uses the symmetry between the marbles in the barrel to argue that each marble has the same epistemic probability of being the selected one; the result then follows by the addition axiom of probability. If the ordering of s is known, however, 126 as it must be for a relative frequency to be defined, then this destroys the symmetry between the terms. I have no idea how Howson's Principle can be derived along these lines, therefore. Judging from his cheerful talk of "standard statistical calculation", such as in the y} test, van Fraassen seems unaware of the theoretical difficulties involved here. 3.7.5 Conclusion It is perhaps surprising that frequency theories have such difficulty getting the right link between probability and frequency, since probability on these accounts is defined as frequency! On these accounts the connection between probability and frequency is both too tight and too loose at the same time - too tight for infinite outcome sequences, and too loose for finite ones. It is too tight in the infinite case, since it makes limiting relative frequencies necessarily equal to chances, which is overly strong. This is the cardinal problem. It is too loose in the finite case, as for a finite outcome sequence s the limiting relative frequency is irrelevant, and it does not provide an epistemic probability for s. This is the problem of empirical access. Howson perceives that to solve the problem of empirical access one must be able to derive epistemic probabilities over finite sequences of actual outcomes from knowledge of the limiting relative frequency. Unfortunately, however, the principle required is very hard to justify, due to convergence being a tail event as well as sensitive to ordering. Limiting relative frequencies in infinite outcome sequences are used instead of real frequencies in finite outcome sequences as a 'FAPP' solution to the cardinal problem, but even if this were satisfactory in itself (which it is not) it creates another problem which is just as severe. The relative frequency is then pushed infinitely far away, out to the limit, making it empirically irrelevant. 3.8 Cond i t i ona l Chances Since, according to the causal theory, the chance function is just the logical probability function, given certain information, it may be conditioned in the usual way. That is to say, 127 conditioning has its usual meaning, in terms of updating beliefs in light of new information, and so the Principle of Conditioning is satisfied. In general we have: PX(A \ B) = Pr(A | £x&bcx & B). Intuitively it seems that conditional chances ought to be meaningful. It validates such reasonable-sounding sentences as: "the chance that the gun will fire, given that the trigger is pulled, is 0.97", since we can interpret "given" in the usual, epistemic way, as the addition of extra information. There is no need, in general, even to restrict such conditioning to events which occur in the past. The term P(A \ B) is perfectly meaningful even if the event A occurs before event B. As an example of this, let us consider an experimental arrangement which has a radioactive source encased in such a way that particles can only escape in one direction. A short distance away there is a detector which clicks each time a particle is registered. (To keep things simple, we shall suppose that time is discrete.) The detector is imperfect in that each time a particle arrives it has only a 95% chance of firing. Also, at any time when no particle arrives, there is a 1% chance of the detector firing.25 At each time, the chance of a particle being emitted is 0.2. Times tx and t2 are such that if a particle is emitted by the source at tx then it necessarily arrives at the detector at t2. Let Ex be the event that a particle is emitted at tx, and let F2 be the event that the detector fires at t2, whether or not a particle was present. One would normally have no qualms about calculating the following probabilities: P(EX) = 0.2 P(F2\EX) = 0.95 P(F2\^EX) = 0.01 P{F2) = 0.19 + 0.008 = 0.198 P(EX\F2) = 0.19/0.198 = 0.96 (approx.). 2 5 F o r simplicity, I am assuming that time is discrete. 128 The question is what meaning can be given to the conditional chances. Some26 have taken the "given", or "conditional upon" as primitive, and incapable of further explication, but within the causal theory of chance we can do better. As stated above, P(EX \ F2) for example is defined as Pr(Ex | IScbc & F2). There is no reason why one should be restricted to conditioning chances on past events. This is just as well, since P(E{ \ F2) seems to be a meaningful term. It is the chance that a particle is present, given that the detector fires. It can even be measured empirically, if there is some way to determine when particles are emitted from the source. One simply looks at the proportion of cases, among the class of detector firings, where particles are emitted the right amount of time before that firing. It may be wondered why we cannot define P(A | B) simply as P(A&B)/P(B). There are two reasons for this. First, and less importantly, this definition would be rather narrow as it would only apply to cases where P(B) exists and is non-zero. The notion of conditional probability is valid in a much wider class of cases than this.27 Secondly, and more importantly, this merely formal definition does not provide any intuitive meaning for P(A \ B), and so the definition would be useless. One could define many such symbols, such as P(A*B)= P(AvB)/P(A), P(A#B) = P(A) + P(A&B), and so on, but what for? What usually happens when such formal definitions are provided is that an intuitive meaning is smuggled in by the use of English words like "given", "conditional upon" and so on. This is cheating, as we require a proof that the formal and intuitive meanings are equivalent. 2 6 See for instance McCurdy (1996). 2 7 T h i s point is argued in §2.6. 129 4. Classical Stochastic Mechanics As stated in Chapter 1, I think an account of physical chance should prove its mettle by being useful in stochastic physics. Physicists do not often say that, in order to understand some aspect of a physical theory, one must first grasp von Mises' or Popper's interpretation of probability! An analysis of chance which does at least engage certain problems in physics is surely preferable to one that is independent of them. The remaining chapters are an attempt to show that the causal theory of chance is indeed useful, if not indispensable, in understanding some aspects of physical theory. In this chapter I shall use the causal theory of chance to develop a very general system of mechanics. Since the formalism is fundamentally probabilistic, I call it classical stochastic mechanics (CSM). I use the word classical since, as we shall see later, it is provably inconsistent with quantum mechanics, and thus is only valid in the classical limit, i.e. for "large" systems, in some sense. 4.1 What is CSM Good For? The main purpose of developing CSM in this thesis is to act as a foundation for the later chapters on quantum mechanics. A comparison with relativity theory may be helpful here. In one of Einstein's presentations of his theory of relativity (Einstein, 1922) he begins, perhaps surprisingly, with a chapter laying out Newtonian kinematics and dynamics. Why does he do this, given that Newton's theory is already well understood? The point is that he develops the old theory in a different, and much more general, framework. Instead of starting with space and time as separate structures, for example, which effectively rules out the Lorentz tranformation of coordinates, he begins with a unified structure of spacetime coordinates. When the Newtonian theory is expressed in this way, the arbitrary assumptions involved become apparent, instead of being hidden. The Galilean transform, for instance, now appears to be 130 rather arbitrary, in acting on only the spatial coordinates. The overall effect of presenting the familiar in a novel way is to broaden the mind, and particularly the imagination, to new possibilities. In my view, to feel at home with quantum mechanics requires a similar broadening of the imagination, the acquisition of new concepts. The easiest way to introduce these, I believe, is to follow Einstein's method, and present the familiar concepts of stochastic physics within a more general framework. We will then be in a position to see the arbitrary assumptions involved, and consider alternatives. The most difficult assumption to drop is an unconscious assumption. The opportunity to develop a general framework for stochastic mechanics is provided by the causal theory of chance. This theory enables one to define some concepts which are usually taken to be primitive, and evaluate some claims which are normally treated as axioms. Apart from the concept of chance itself, one is able to give a detailed analysis of probabilistic independence, and also the notion of a boundary condition. The claims that are provable include: (i) Causal independence implies probabilistic independence. (ii) Traces of an event may succeed, but cannot precede, that event. (iii) Forward, but not backward, transition probabilities are lawlike and time independent. (iv) Reichenbach's Common Cause Principle holds. (v) Entropy tends to increase. CSM is of course related to other stochastic approaches to mechanics, and these connections will be spelled out as we go along. The basic difference, however, is that CSM is to be used for theoretical rather than practical purposes. As an analogy, consider the difference between a formal system used to construct proofs in its object language, and one used to prove mathematical results (in the metalanguage) about formal systems. The former will be more 131 complex, containing many rules of inference to facilitate concise reasoning, as well as syntactic rules permitting the omission of brackets, and so on. A formal system used as an object about which theorems are proved, however, will be "stripped down" as far as possible. The number of connectives, quantifiers, rules of inference, etc. will be minimised for the sake of simplicity. 4.2 The Law Function A stochastic system, X, is modelled by a stochastic process {X (0} whose range is Hx, the class of possible histories for X. 1 The proposition lx, which was introduced in §1.3.2, is called the generalised lagrangian of X, and is a maximal description of X's dynamical nature. A possible history is a maximal description of the actual behaviour of the system which is consistent with £ x , so that lx determines Hx. Each possible history describes the system in the same set of times Tx. 2 Possible histories of a system will be denoted by bold lower-case letters, such as x. Recall from §1.5.4 that the possible history which happens to be the actual history is still an abstract object. It is not identical to the concrete history, but is the best representation of the concrete history. We can now re-state the definition of chance from the previous chapter using the language of stochastic processes. It is that PX(X=x) = Pr(X=x \ £x&bcx), where x is a possible history of X and bcx is the boundary condition for X. We will usually write 'Px(xY in place of iPX(X=xy for brevity. CSM is derived from five postulates concerning the nature of real systems. The first of these is as follows. ' i n this chapter we are only concerned with chance distributions for random vectors and variables, so there is no need to specify a sample space. The definition of a random variable as a P-measurable function D.—>S is just a mathematical nicety; terms like P{X=x) are unambiguous even i f Q is unspecified. (Think of CI as the class of all possible worlds, i f you wi l l . ) 2 Note that I assume a "B-series", or "block universe" view of time, where each time is the present f rom its own point o f view. (See McTaggart (1908).) I can make no sense of the idea that time f lows, or that the present moves, or that the present has a special metaphysical status. 132 CSM1 The chance of a history x of X does not depend on the boundary condition be for X, but only on whether or not JC satisifies be. In other words, if bc{ and bc2 are two possible boundary conditions for X, and x satisfies both of them, then Pr(X=x | £x&bc{) = Pr(X=x \ £x&bc2). C S M 1 gives rise to a notion of dynamical facility, which is a generalisation of the idea of dynamical possibility that one has in deterministic mechanics. Whereas dynamical possibility is a bivalent quantity, having only the values possible and impossible, the dynamical facility of a history make take any one of a continuous range of values. Dynamical facilities are given by a measure that I call the law function, which is defined as follows. Definition 4.2.1 The law function maps each history x of X to the chance x has if it satisfies the boundary condition. In other words, L^x) = Pr(X=x | £x & be), where be is any boundary condition which x satisfies. The law function is thus similar to the chance function, but there is an important difference. The law function for X is determined entirely by £ x , the generalised lagrangian for X, and is independent of the actual boundary condition. Consider, for instance, a history x which is inconsistent with the actual boundary condition bcx. The chance of x is zero, since from bcx we can infer with certainty that JC is not the actual history, yet the law value of x will not be zero, in general. We may express the relation between Px(x) and L^x) as follows: f L v (x) if x satisfies the boundary condition [0 otherwise. 133 It is clear that exists if and only if CSM1 holds, since if different boundary conditions consistent with x give rise to different values of Px(x), then there is no such thing as the chance x has for all boundary conditions it satisfies. Why do we bother to define L\, however, and why call it the "law" function? The first point to make, in answering these questions, is that for a deterministic system the law function is equivalent to the equation of motion. This is best explained with an example. Consider a classical particle X in one dimension, whose equation of motion is: Note that CSM 1 is satisfied here, as any two boundary conditions consistent with x give it the same chance (either 0 or 1, depending upon whether x satisfies (1).) This equation defines a law function as follows. If a history x(t) satisfies (1), then L^ix) = 1; if JC does not satisfy (1), then LX(JC) = 0. In general, assigns the value 1 to each dynamically possible history, and 0 to each impossible history. A deterministic equation of motion, of course, is satisfied by a history if and only if it is dynamically possible, so that it contains the same information about the histories as L^. Equations of motion are considered to be laws, of course. Does a stochastic equation of motion also satisfy CSM1, and define a law function? It does, as will be explained with another example. Consider a particle X(t) in one-dimensional Brownian motion, that is buffeted by a random force F(t), due to collisions with other particles. If a is a friction coefficient that depends on the viscosity of the fluid, and Y(t) is the particle velocity, then the particle can be modelled by Langevin's equation:3 (D in dY(t) dt = - a V ( 0 + F(f). (2) 3See Snyder (1975: 186-7). 134 What is the law value of a particular history4 v(t), according to this equation? Let fv(t) be the forcing function which, together with v(r), satisfies equation (2). /v(f), if it exists, is assigned a probability by the stochastic process F(f). This probability, p say, is also the law value of v(t). It is the physical chance of v(t), if it satisfies the boundary condition. Note that the chance of v(t) does not depend on the boundary condition as such, according to equation (2), but only on whether the boundary condition is satisfied. Other stochastic equations of motion, such as the Fokker-Planck equation5, can also be expressed as a law function. These equations all have the property expressed in CSM I, that the chance of a history does not depend upon the specific boundary condition applied, but only on whether or not the boundary condition is satisfied. Applying a boundary condition to a stochastic equation of motion, one obtains a chance distribution over the class of histories, and so the stochastic differential equation does the same job as the law function. The law function gives far less insight than a differential equation into the dynamics of a system, of course, as the individual terms of an equation can be given individual physical interpretations. Since we do not consider any particular systems in this chapter, however, this is not a handicap. Moreover, the law function has the advantage that we do not have to specify the members of HX - they need not even be mathematical functions. As stated above, the law function is a measure of the "dynamical facility" of a possible history. If a history has law value 0, for example, then it has no dynamical facility at all - the system simply cannot follow that trajectory. A law value of 1, on the other hand, represents perfect dynamical facility. Once that trajectory has been embarked upon, the system will necessarily follow it all the way. In a stochastic system, the dynamical facility of a history will lie somewhere in between these extremes. 4 F o r simplicity, I am assuming that the law function assigns non-zero measure even to single histories, although in general we must of course deal wi th Borel classes of histories. 5See Gardiner (1983:117-174). 135 It is useful to have a measure which is determined by £ x alone. Many properties of stochastic systems are independent of the boundary condition. These properties correspond to rather simple arithmetical relations between law values, and more complex ones between chances, as one would expect. We shall see, for example, that the Markov property can be expressed as a very simple constraint on L. In the above discussion I have assumed that the law function is defined on individual histories but in general, like any measure, it is defined on classes of histories. This is not a serious complication, as we can modify Definition 4.1.1 so that L^C) = Pr(Xe C \ £x & be), where every history in C satisfies be. This defines on an algebra of rectangles, and then we use the Caratheodory technique6 to extend to a c-algebra. To specify the details of this construction would require some specific assumptions about Hx, the class of possible histories. In a theoretical work such detail is surely unnecessary. Since the existence of on a o-field of subsets of Hx presents no more difficulty than the existence of a normalised measure such as Px, there is no reason to suppose that any problem will arise here. 4.3 Relevance and Correlation Before we can consider the dynamics of composite systems, we must understand some basic facts about relevance and correlation. This work is foundational to the major problem investigated in this chapter and the next, namely how to understand correlation with respect to chance. In this section we shall define all the necessary concepts, lay down some postulates and prove some useful theorems. According to the causal theory of chance, relevance and correlation are, at bottom, the same relation. This is counter-intuitive, since 'relevance' is associated with epistemology and 6See Breiman (1968:393). 136 'correlation' with physics, but will be argued for. Let us begin by defining epistemic relevance in the standard way. 4.3.1 Definitions (i) A is relevant to B within epistemic state K iff PK(B\A) # PK(B). (ii) A is positively (negatively) relevant to B iff PK(B\A) > (<) PK(B). For the first definition, the probabilities are intervals in general, and single values only as a special case. Thus, for example, if PK(B | A) = [0.1, 0.4] and PK(B) = [0.1, 0.3), then A is relevant to B. Intuitively, A is relevant to B within K just in case, for someone with epistemic state K, learning that A is true makes some kind of difference to their attitude to B. 4.3.2 Theorems (i) Relevance is a symmetric relation; (ii) If A, B are mutually irrelevant, then PK(A8cB) = PK(A)PK(B). The proofs of these are trivial. Mutually irrelevant propositions are sometimes described as 'independent'. To avoid confusion, however, I shall reserve the term 'independent' for causal independence, i.e. non-interaction. The term 'correlation', although it is usually reserved for physical contexts, has a purely formal definition. In fact, the definition of correlation is formally identical to the one given above for relevance. The difference is that the term 'correlation' is rarely (if ever) used where the probability function in question is an epistemic probability. For a correlation, the normalised measure must be either physical chance or relative frequency. There are thus two types of correlation, which we shall call correlation with respect to chance and frequency correlation. These two are quite distinct, although one may be evidence for the other. Frequency correlations are the kind that can be measured directly, by counting. They hold, of course, not between singular propositions like "Jim is a surgeon" and "Jim is wealthy", but rather between properties, such as being a surgeon and being wealthy. 137 Frequency correlations are often good evidence for a correlation with respect to chance. Suppose there is a random number generator which produces numbers between 1 and 20, inclusively. After (say) a hundred numbers have been picked, we observe that there is a positive frequency correlation between a number thus produced being even, and its being greater than 10. That is, the proportion of even numbers among those greater than 10 is higher than their proportion among numbers less than 10. This does not guarantee, but is some evidence for, a correlation with respect to chance for these two properties. The correlation with respect to chance, however, applies at the level of each individual trial rather than the reference class of 100 trials. The frequency correlation does not entail a correlation with respect to chance, since the absence of a correlation with respect to chance is perfectly consistent with there being a frequency correlation - the frequency correlation may be, as one says, due to chance. See how unfortunate that expression is, from my point of view! A frequency correlation which is "due merely to chance" is precisely one that is not associated with any correlation with respect to chance! The expression seems to derive from the ancient notion that some events are caused by a being called "Chance", an entity rather like Fate, Destiny or Providence. Alternatively, it may arise from the idea that some events have causes, and others do not. An uncaused event is said to occur by chance, so that one might ask: "Was it caused, or did it just occur by chance?". As I have argued in Chapter One, both of these notions are deeply erroneous and should be eliminated. I will sometimes use the term chance correlation as a synonym for correlation with respect to chance, for convenience. I hope that this will not cause any confusion of the kind discussed in the previous paragraph. When is one proposition relevant to another? We will not answer this question in general, but only consider the case of two propositions which are about separate systems, which we call X and Y. X and Y are separate in the sense that they are entirely disjoint, having no 138 common part, yet they need not be causally isolated from one another. The Earth and the Sun, for instance, are separate systems, yet they interact in various ways. In order to talk of two systems X and Y being correlated, with respect to chance, we need to compare the chance functions for X and for Y with some sort of joint chance function for X and Y together. This joint chance function is most naturally and conveniently identified with the chance function of a third system, namely the composite system which consists of X and Y together (and nothing else). We thus require that, for every pair of system X and Y, there exists a composite system <X,Y>. The composite system need not have any essential unity or cohesion - it may be an arbitrary collection of spatially-distant systems. It is designated a "system" simply because it may be treated as a system in the formalism, i.e. it has its own lagrangian and boundary condition. Note also that <X,Y> is a concrete object, and thus quite distinct from the class {X,Y}. There is no difference, for example, between « W , X > , Y> and <W, < X , Y » . The second postulate of CSM is therefore as follows. CSM2 If X and Y are systems, then so is the collection of X and Y, which we write <X,Y>. Thus <X,Y> has its own lagrangian and boundary condition. If we call the collection of all physical systems the cosmos, then it should be noted that CSM2 entails that the cosmos is a physical system. At this stage we deliberately do not specify which objects count as systems in the first place, to keep things as general as possible. The basic postulate concerning relevance is as follows: 4.3.3 Postulate If X and Y are separate systems, and A x , A Y are propositions about X and Y respectively, then A x , A Y are mutually irrelevant within K0J 1K0 is the state of maximal ignorance, defined in §2.2.2. 139 As noted in Chapter Two, terms like Pr(A \ O) are generally undefined unless either A or —A is a necessary truth. That is, there is no single value which is the correct degree of belief in A, determined by rationality alone, but there is always an interval of permitted degrees of belief, even if it is only (0,1). Moreover, according to Definition 4.3.1, relevance is defined in terms of interval probabilities, with single-point probabilities merely as a special case. A x and A Y are mutually irrelevant within K0 just in case Pr(Ax \ T) and Pr(Ax | AY) are the same interval. Postulate 4.3.3 entails that, if A x is relevant to A Y within some epistemic state K, then K contains some knowledge of the systems X and Y. The question now before us is: What general character must K have in order for A x to be relevant to AY? The following theorem helps us to answer this question. 4.3.4 Theorem If K contains knowledge of X only, then A x is irrelevant to A Y within K. Proof Suppose K consists only of the proposition Bx concerning X. Then PK(AY \ Ax) = Pr(AY | A x & Bx). But A x & Bx is a proposition about X alone, and thus by Postulate 4.3.3 we have that Pr(AY \ Ax & Bx) = Pr(AY \ T). Moreover, by the same reasoning, PK(AY) = Pr(AY \ O), and so we have PK(AY | A x ) = PK(AY), which is the result.B By symmetry it is clear that, for A x to be relevant to A Y within K, K must not merely contain information about Y. We see therefore that K must contain information about both X and Y. This information cannot in the form of two separate propositions, however, one about X and the other about Y, as the following theorem shows. 4.3.5 Theorem If K consists of Bx and 5Y, concerning X and Y respectively, then A x is irrelevant to A Y within K. 140 Proof PK{AY Ax) - Pr(AY\Ax&(Bx&BY)) Pr(AY\(Ax&Bx)&BY) R( AY&BY | AX&BX , BY | AX&BX ) Now, PK(AY) = R{ AY&BY , BY). Pr(AY\Bx&BY) R(AY&BY | BX,BY | Bx) R( AY&BY , BY). Thus P^(AY U x) P^Ay), as required.* 4.3.6 Definition Let 5< X Y > be a proposition about <X,Y>. Then {5<XY>}X is the maximal proposition about X that is logically entailed by B<XY>. 4.3.7 Definition A proposition B<x Y > concerning X and Y is said to befactorisable just in case it is equivalent to the conjunction Bx & BY, where Bx = {B<x Y>}X and5Y= {5<XY>}Y-The propositions Bx and BY may be considered "factors" of B<XY> since truth-functional conjunction is sometimes called the logical product. Indeed, if we represent The True by 1 and The False by 0, then conjunction is multiplication. Note that B<XY> is only considered factorisable if each of its factors is only about one system. 4.3.8 Corollary If Kconsists of B<XY> and A x is relevant to A Y within K, then B<XY> is not factorisable. 141 Proof Immediate from Theorem 4.3.5.B What are some examples of non-factorisable propositions about X and Y? Conditionals are perhaps the most obvious kind. The statement "If X=l then Y=l", for instance, cannot be factorised. The same holds for disjunctions, of course. A more interesting class of examples is where K contains information about causal interaction, such as "X interacts with Y". It is clear that this cannot be factorised (it is not equivalent to, for instance, "X interacts with something and Y interacts with something"). If an epistemic state K does not factorise with respect to the two systems X and Y it concerns, then we might say that the knowledge K of the two systems is entangled* Alternatively we might say that the knowledge K is non-local, as it irreducibly concerns both systems together, which may be far apart in spacetime. It is important to realise that the converse of Corollary 4.3.8 does not hold, as not every non-factorisable proposition gives rise to relevance. Consider, for instance, the information that X does not interact with Y. This cannot be factorised, yet intuitively at least it does not create an inferential "bridge" between X and Y; i.e. knowing that X and Y do not interact does not enable one to make inferences about Y from statements about X. The presence of non-factorisable knowledge is a necessary condition for relevance, but not a sufficient condition. I shall assume, however, that if the epistemic state K says an X-Y interaction does (or might) occur, then this does provide an inferential bridge between X and Y. 4.4 Chance i n a Composi te System Let Z be the composite system <X,Y>. Since Z is a system, it must have a lagrangian lz, which determines a class of histories Hz. The first question to examine, therefore, is how Hz is related 8Schrodinger (1935:161) uses the term Verschrankung unseres Wissens to describe this situation. 142 to Hx and HY. The usual assumption to make here is that each member of Hz can be identified with a pair of histories, one history for X and one for Y. In other words, if zeHz, then there exist xeHx andye/7Y such that z = <x,y>. Since x, y and z are states of affairs, we can also write z = x & y. Using the notation of Definition 4.3.6 we shall say that zfactorises into x &y. Also note that, according to this postulate, Hz is the cartesian product of Hx and HY. This assumption that histories of composite systems factorise seems intuitively correct, and also has powerful consequences for the structure of the chance function. We should therefore examine it carefully, to see if it depends upon any substantive assumptions about the system. It clearly is not straightforwardly a matter of logic since, as we have seen, some propositions about composite systems do not factorise in this way. Perhaps the best approach here is to look at why a history z may not factorise into a conjunction x&y, to see if it is viable. The reason is that z includes information that is intrinsically relational, such as the proposition "X interacts with Y". This particular example does not factorise, but is also not the kind of information that may appear in history descriptions such as z. The basic idea is that z describes what Z actually does, its behaviour in some sense. In the case of a pair of simple point-particles, for example, z just describes their trajectories. Thus, while it may be true that the particles interact, this is a fact about the causes of Z's behaviour, and does not describe the behaviour itself. A history z just describes motions, in the broadest sense of the word. Some relational descriptions do describe behaviour, however, such as "X is longer than Y" and "X and Y are one metre apart". The idea here is that properties such as length and position are irreducibly relational, i.e. we can only determine the length of one body relative to another, and so on. It may be true that properties such as length are ultimately relational, but this does not by itself prevent the above propositions from being factorised. We can, for instance, determine the precise length of X and of Y in relation to some third body, obtaining 143 ratios of 3.2 and 1.7 perhaps. These separate statements about X and Y 9 together entail the required proposition that X is longer than Y. The situation for positions is exactly parallel. We can describe the position of X and of Y relative to some third reference body, and these two facts will together entail the propostion that X and Y are one metre apart. The above method of factorisation, whereby a relation between X and Y is reduced to a relation between properties of X and Y, or perhaps to an X-W and a Y-W relation, may not always be available. Consider, for instance, the example of relative probabilities from §2.5. We found that, for some propositions A and B, such as those which are "infinitely unlikely", we could not reduce the relative probability R(A,B) to the ratio of probabilities Pr{A)IPr(B), i.e. R(A,0)IR(B,0). Roughly speaking, the set of real numbers is not sufficiently fine grained to reduce all relative probabilities to ratios between absolute probabilities. Moreover, this problem is ineliminable, as there is no finer-grained structure (of the appropriate kind) than the real numbers. The real numbers are the best structure there is, but they are still not good enough. The relative probability R(A,B) can be expressed as a ratio such as R(A,C)IR(B,C), where C is another proposition of "infinitesimal probability". In general we might say that R(A,B) exists only if A and B are not too dissimilar in probability. Thus R(A,0) does not exist, but R{A,B) does. A and O are incommensurable in probability. Suppose X and Y are similar in a certain respect, so that some relation holds between them, as described by the history z. If we wish to factorise z then this will involve finding another system W which is commensurable with X and Y in the respect in which they are similar. For instance, in order to factorise a statement about the relative length of Y with respect to X, we must find another body with a length that is commensurable with both X and Y. This procedure will run aground therefore if there simply is no such third system W that is 9 I f the third system is W , then the statements are really about < W , X > and < W , Y > , of course. 144 commensurable with X and Y in the required way. This may occur perhaps if X and Y have just interacted with each other, leaving special traces or signatures on each other. In spite of these reservations, we shall include the factorisability of histories among the postulates of CSM. It may not be universally valid, but it is a defining assumption (perhaps even the defining assumption) of classical physics. It is needed to derive the familiar, common-sense independence properties of the chance function. CSM3 If Z is the composite system <X,Y>, then each history z of Z is equal to some x & y, where x andj are histories of X and Y respectively. CSM3 gives us an extra theorem about relevance, that inferential bridges can be destroyed by adding extra information. Suppose the proposition 5 < x Y > about the history of <X,Y> is not factorisable, and makes X relevant to Y. Then consider the proposition C x , which is a complete description of X's history. If K is the minimal state including B<XY>&CX, then is X relevant to Y within Kl In fact it is not relevant, as 5 < x Y>&CX is factorisable, as is shown below. 4.4.1 Lemma If B<x Y > is complete concerning X, i.e. {B<x Y >} x is complete, then 5 < x Y > factorises. Proof: Let Bx = {B<XY>}X. By hypothesis, Bx is complete, i.e. equivalent to some single history x. B<x Y > on the other hand is a disjunction of histories z,, viz: £<x,Y > = l k i = l l ( * , & y , } 145 where the union symbol denotes truth-functional disjunction. Now, since Bx = x, the xt in the summation must all be equal to x. Thus: * < x ,Y > = U ( x & y . ) = x & U y ; And so B<x Y > factorises into Bx & BY, as required. • 4.4.2 Corollary If Cx is complete, then C x & B<XY> factorises. Proof: Since Cx is complete, it follows that {Cx & 5<XjY>}x l s complete. The result then follows using Theorem 4.4.1.B We should now examine how Lz is related to and Ly. By the definition of Lz(z) we have L z(z) = Pr(Z=z I £z & bcz), where bcz is any boundary condition consistent with z. Now, according to CSM3 we may replace Z=z with X=x & Y=v, which yields: Lz(z) = Pr(X=x & Y=y \lz & bcz) = Pr(X=x | lz & bcz)Pr(Y=y \ lz & bcz & X=x) = Lz(x)Lz(y\x). 146 To dissect this expression further, we must consider how £ z is related to £ x and lY. We should first note that lz does not determine either lx or £ Y (we might say that £ z does not "know" what £ x and £ Y are). This is because £ z does not determine the actual history of either X or Y and, on account of the X-Y interaction, £ x depends on the actual history of Y, and £ Y depends on the actual history of X. Thus, £ z together with X=x determines £ Y , but lz alone does not. Let us now consider the special, and very important, case where X and Y cannot interact. This means that the chance of interaction is zero, so that £z & bcz together entail that no interaction occurs. In this case, since £ Y does not depend on the actual history of X, £ z (together with bcz) does know £ Y , as well as £ x of course. When X and Y cannot interact, therefore, £z & bcz can be written as £x & £Y & bcz. The two propositions are logically equivalent. In short, if X and Y cannot interact (for a given bcz) then, within bcz, £z factorises into £ x & £ Y . Causal independence entails that the joint lagrangian factorises. Another important question is how bcz, the boundary condition for Z, relates to bcx and bcY. In the spirit of CSM3,1 shall assume that bcz = bcx & bcY, regardless of whether X and Y may interact or not. I shall later argue that bcz, bcx and bcY are states, and that states are just very short parts of a history, so this assumption is really a consequence of CSM3. 4.4.3 Theorem If, for a particular bcz, systems X and Y cannot interact with each other, then they are uncorrelated w.r.t chance, i.e. Pz(x & y ) = Px(x)PY(y). Proof 4. 2. 3. l.Pz(x&y) = Pr{x &y\£z&bcz) Prix&y\£x&£Y&bcx& bcY) Pr(x Ux & £Y & bcx & bcY)Priy \x & £x& £Y & bcx& bcY) Prix | £x & bcx)Priy \ £Y & bcY) 147 5. Line 1 follows from the definition of Pz. Line 2 just factorises lz & bcz. Line 3 applies Axiom 4, the Principle of Conditioning. Line 4 applies Theorem 4.3.5, since x &£x& bcx concerns X only, and lY & ^ C Y concerns Y only. Line 5 follows from line 4, using the definitions of Px and PYM 4.5 Sub-Histories, States and Markov Systems So far in this chapter we have only considered one kind of event within a system X, namely an entire history for X. Physicists frequently refer to events which are less extensive than whole histories, however, so we shall look at what these "smaller" events may be. It will then be possible to define and discuss the Markov property of stochastic processes. Consider a system X which exists in a time interval Tx. This interval can be mentally split into sub-intervals, of course. Such a division of Tx automatically splits up JC , the actual history of X, into corresponding chunks. For instance, if Tx = [0,10], and we split it into [0,4) and [4,10], then we can define x{ as the part of x which occurs in [0,4), and x2 as the part that exists in [4,10]. We note that the motion or change that occurs in JC also occurs in JC , and J C 2 together: some of the motion is in J C , , and the rest shows up in x2. If we partition Tx into finer time slices, such as [0,5), [5,28), [10-8,10], where 8>0, then the amount of change in each sub-interval is reduced, but it still appears that all the change in JC shows up in the time slices. So far I have not introduced the concept of a state of a system, which is one important kind of event, but it seems to be closely connected to the idea of a time slice. Perhaps a possible state of X just is a thin time slice of a possible history of X, such as the piece of JC in [n5, (n+l)6)? This does not seem quite right, however, since regardless of how small 5 may be, X undergoes some change in this sub-history. States are supposed to be "static", and not containing any motion at all. To get sub-histories which are also states, therefore, perhaps we 148 can partition Tx into the individual real numbers of which it is composed? The sub-histories of X corresponding to this partition will be very thin indeed, and it does appear that each one contains no change. Unfortunately, we run into Zeno's paradox of the arrow here. If each of these very thin sub-histories contains no motion at all, then where is all the change in xl Consider, for instance, the flight of an arrow from the bow to the target, a distance of perhaps 100m. The total history contains 100m of motion, and so the collection of sub-histories should contain 100m of motion as well. This is the case for a finite partition into sub-histories, but not for an infinite partition into states. In each state the arrow moves zero metres, and an infinite sum of zeros is still zero. It appears that all the motion occurs "between" the states, but in that case we do not have a genuine partition of Tx, as something in the history x has been lost. For this reason it seems to me that the concept of a state is something of a mathematical fiction, rather like the Dirac delta function. It should not be taken literally as a real entity, but rather as the fictional limit of a real sequence. In the limit as a (finite) time slice tends to a point, the amount of change may be treated as negligible, and so that time slice can be regarded as a state. This is essentially the same problem that we face in the classical mechanics of a single particle. Although the whole trajectory is captured by the position function r(t), the state at a single time t' is not fully represented by r(t'), as this gives no information about the momentum of the particle. To fix the momentum at t' we require a short (but finite) portion of the trajectory that includes t'. Since this time interval may be arbitrarily short, it makes sense to regard a state as a limiting concept, as in the following definition. 4.5.1 Definition A possible state of a system X at time t is the limit of a sequence of time slices of a possible history of X, which converge to the interval {t}. We must now consider some general properties of stochastic systems in relation to time. The main such property is known as the Markov condition, which real systems are frequently 149 assumed to possess, as least to a good level of approximation. We shall now examine the metaphysical basis of the Markov condition by attempting to derive it from more fundamental assumptions. The Markov condition may be expressed in various equivalent ways, one of which is as follows: 4.5.2 Definition A system X is Markov iff the distributions for sub-histories in disjoint intervals are conditionally independent, given a sub-history in between them. One consequence of the Markov property is that if t0 < tx < t2, and you want to predict the behaviour of a system at t2, then knowledge of the state of the system at tx renders the state at t0 irrelevant. This is sometimes expressed by saying that the state at t{ screens off the state at t0, as far as the state at t2 is concerned. It seems clear that, if this is true, then the system at t0 cannot directly interact with the system at t2, under the general principle that causal interaction implies logical relevance. To make sense of the idea of different time slices of a system interacting with each other, we need to consider the finite time slices of a system as systems themselves. We therefore require the following postulate. CSM4 Finite time slices of a system are causally-isolated systems. (Each time slice has its own lagrangian and boundary condition, and non-intersecting time slices of a given system cannot interact with each other.) If CSM4 is true, then we may be able to derive an expression relating the chance distributions of the whole system to the chance distributions of the time slices, which would be analogous to Theorem 4.4.1. One problem that arises here is the question of boundary conditions for the time slices. We have not yet discussed what form of constraint the boundary 150 condition is in general, and for this reason it is hard to make sense of CSM4. If a time slice is an autonomous system then it should not only have its own boundary condition, but also the same basic kind of boundary condition as the total system. As we shall see, this means that CSM4 is quite restrictive, as it rules out most types of boundary condition. 4.6 B o u n d a r y Condi t ions and T i m e In deterministic mechanics the topic of the metaphysical status of boundary conditions is not often discussed. The behaviour of a system is often thought to be produced by the dynamical properties of the system alone, with the boundary condition seen as mere "information", rather than as a second cause.10 I believe that this point of view results from the fact that, in a deterministic system, the boundary condition is "invisible", in a sense which is explained below. In a stochastic system, on the other hand, the boundary condition is "visible" and so has to be taken seriously as a physical cause. In a deterministic system X there are only two grades of dynamical facility, namely perfect facility and none at all, which correspond to the two law values 1 and 0 respectively. One consequence of this fact is that a given chance function over Hx can be generated by many different boundary conditions. Consider, for instance, the boundary condition X(/,)=x,. This fact, together with £ x , entails that the actual history is the unique xeHx such that x(tl)=x1 and Zoc(JC)= 1. But now, any other complete boundary condition consistent with x, such as X(f2)=jc(f2), will yield that very same history JC. Thus, if one only has the chance function Px for a deterministic system X , then one has no idea which of these possible boundary conditions is the real one - it could be set at any time in Tx. In this sense, the boundary condition of a deterministic system, considered as a cause, is invisible (to the chance function). More precisely, every deterministic system has the following three properties: 1 0 Recal l that, in §3.1,1 argued that (at least some) boundary conditions represent physical constraints. 151 (i) The nature of the real boundary condition is empirically inaccessible. (One cannot infer anything about the boundary condition from observation or experiment.) (ii) One cannot explain any observed phenomenon (natural or experimental) by a hypothesis about the boundary condition. (iii) In particular, one cannot explain the arrow of time phenomena by a hypothesis about the boundary condition. To see that (i) is true, consider two rival hypotheses hx and h2 about the boundary condition, where hx says that the boundary condition is bcx and h2 says that it is bc2. If JC is the observed actual history for X , and hx, h2 are consistent with JC, then P x(JC | £x&bcx) -Px(x | lx&.bc2). Then, by Miller's principle, we have that PK(x | hx) = PK(x | h2), so that: PAW = P^x^PM P^lx) P^hJP^h,) = p M Thus the empirical datum JC has no bearing on the relative probability of hx and h2. To see that (ii) holds, recall from Chapter 1 that to explain a phenomenon is to infer it from a hypothesis about its cause. Thus, to explain JC by hx one must infer JC from hx. One problem here is that there are too many rival explanations. For instance, one could explain the same datum JC by the alternative hypothesis h2. Since every boundary condition consistent with JC successfully entails JC, JC itself does not help us to choose between these competing explanations. Another problem is that even the most bizarre counterfactual data could be 152 explained just as easily as the actual data. As long as a history is dynamically possible, it will be actual provided it satisifes the boundary condition. To explain phenomena within a deterministic system by a hypothesis about the boundary condition is thus a very weak explanation. The choice of explanatory hypothesis is arbitrary, and also by this method we could explain just about anything. Such explanatory power is not considered a theoretical virtue! It may be possible to judge some hypotheses to be a priori more (epistemically) probable than others but, since the boundary condition is an external constraint on the cosmos, this means that the explanation is more metaphysical than physical. While I do not wish to rule such explanations out as generally invalid, I take it that a physical explanation, if available, is preferable to a metaphysical one. Claim (iii) concerns the famous "arrows of time", which are discussed in §4.7. They are temporally-asymmetric physical phenomena that are awkward to explain, in light of the fact that the laws of physics seem to be time symmetric. Claim (iii) follows from (ii), of course, but some more particular remarks may be made. It is orthodox among physicists to attribute temporal asymmetries to special initial conditions, as by the following authors:11 While matters are by no means universally agreed upon, the most plausible view at the present time seems to be that in order to get a reasonable picture of the entropic increase accompanying expansion of our current phase of (at least the ' local ') universe, we must impose a low entropy init ial condition on the big-bang singularity. ...we are led more or less inevitably to cosmological considerations o f an init ial "state of the universe" having a very small Boltzmann entropy. That is, the universe is pictured to be born in an n S k l a r (1986), Lebowitz (1993, 36), Feynmann (1967, ch. 5). 153 init ial macrostate M 0 for which [the phase space volume] is a very small fraction of the "total available" phase space volume. ...it is necessary to add to the physical laws the hypothesis that in the past the universe was more ordered, in the technical sense, than it is today ... to make an understanding of the irreversibil ity. For such an explanation to work one must assert that some hypotheses about the boundary condition have much higher epistemic probabilities than others, which is a metaphysical rather than a physical statement. If this were the only explanation on offer then we would perhaps be justified in accepting it. I will argue, however, that a physical explanation is available, within stochastic physics. To sum up, it seems that physics requires the concept of probability. If one accepts indeterminism then these probabilities exist within the cosmos, so to speak, and are thus squarely within the domain of physics. If one insists on determinism, on the other hand, then the probabilities are forced out of the physical cosmos itself and into the Great Beyond. For this reason, among others, I believe that the cosmos is genuinely stochastic. For a stochastic system, none of (i)-(iii) holds, as we shall now see. Consider, for instance, the two boundary conditions X(t{)=x and X(t2)=x for the stochastic system X. Do they give rise to the same chance function P x? They do not, as Fx(X(r[)=jc) = 1 in the first case, but in the second case it will have some value less than one. If t} and t2 are widely separated in time, then Px(X(t{)=x) may be quite low for the boundary condition X(t2)=x. Thus, if X is stochastic then Px depends strongly on the time at which the boundary condition is set. Now, as shown in Chapter 3 , Px is empirically accessible by measuring frequencies, so it follows that facts about the boundary condition are also empirically accessible, at least to some extent. Indeed, in this section I will argue that from certain data we can infer that the boundary condition for the cosmos is set at the Big Bang. 154 We should also note that, for a stochastic system, a boundary condition that physically constrains the actual history at a single point in time introduces a temporal asymmetry, and a local direction of time. (One temporal direction points towards the time of the boundary condition, and the other points away from it.) This is particularly clear in the case where the boundary condition is set at one extremum of the time interval Tx, where the orientation of time is the same throughout the interval. The final postulate of CSM is that, in actual physical systems, the boundary condition is set at a single point in time.12 This is not just true for the cosmos as a whole, I claim, but also for the subsystems within it, including time slices of systems. I shall argue for this postulate by consideration of theoretical elegance, in this section, and on empirical grounds, by showing that it has the right physical consequences, in the next section. CSM5 The boundary condition of a system X specifies the state of X at one instant of time, i.e. bcx is of the form X(r,)=x,. The first argument for CSM5 is that it allows CSM4 to be true, as will now be shown. The following notion of a connected partition will be useful here. 4.6.1 Definition A connected partition of Tx is a class of closed sub-intervals of Tx whose union is Tx and whose pairwise intersection is either empty or a singleton set. Thus, for instance, if Tx is [0,10], then one connected partition of Tx is {[0,3], [3,4], [4,10]}. Consider this connected partition of a stochastic system X. If the boundary condition is at a single point of time, then it constrains only one time slice of the partition (unless it happens 2 O r something similar, such as a surface of spacelike-separated points. 155 to hit the point of intersection of two adjacent intervals, a case which we consider below). For example, let us suppose that the boundary condition is set at t=3.l. The time slice [3,4] then has a lagrangian and a boundary condition, which are sufficient to cause its actual history. Thus the actual history of this time slice comes into being. Now, since the history in [3,4] is actual, the states at t=3 and f=4 are actual as well since they are sub-histories within this interval. The state at t=3, however, is also part of the interval [0,3], so the interval [0,3] has one of its states constrained by the actual history in [3,4]. This constraint on [0,3] is a boundary condition for it, so it has a sufficient cause for its own actual history, which also comes into being. In a similar way, the state of [3,4] at t=4 provides a boundary condition for the time slice in [4,10]. If the boundary condition falls on the intersection of two time slices, such as at t=3, then it causes histories for both time slices [0,3] and [3,4]. The actual history in [3,4] then provides a boundary condition for [4,10] as before. We see then that the time slice which has the boundary condition for the total system X provides a (single-time) boundary condition for each of its neighbours. These, in turn, by acquiring actual histories, provide boundary conditions for their adjacent time slices, and so on. Thus each time slice receives its own single-time boundary condition, and hence CSM4 is satisified. It is hard to conceive of any other kind of boundary condition which will yield this same result. For a particular connected partition of X we can no doubt cook up such a boundary condition, but this is uninteresting since any partition of X is arbitrary, and so if CSM4 is valid it should hold for all such partitions. The second argument for CSM5 is that it (together with the other postulates) enables one to prove the Markov property, which (intuitively) ought to follow from CSM3 and CSM4 together. Moreover, closed systems are usually assumed to be Markov, since in that case the system cannot have an external "memory". Before we can show this, however, we need some preliminary results. 156 4.6.2 Theorem The law function is point normalised, i.e. if C is a subset of Hx containing all the histories x such that x(t')=x, for fixed t', x, then Proof L x ( Q = P T ( C U X & X(t')=x), by Definition 4 .2 .1 , since each member of C satisifes the boundary condition X(t')=x. But, since C is the class of all histories in Hx satisfying this condition, Pr(C \ l x & X(t')=x) = 1 . • 4.6.3 Definition Suppose {T,} = {T{, T2,Tn} is a connected partition of Tx. Then possible histories in T, will be denoted at, b{, etc. and variables over these Xj. The lagrangian for Ti is denoted lt, and the corresponding law function is L,. The main result that is used to prove the Markov property is a product rule for time slices, analogous to Theorem 4 .4 .1 , as follows. 4.6.4 Theorem Consider a connected partition {T(} of Tx which splits the history a into connected sub-histories {a,}. Then L^a) = L,(a1)L2(a2)...Ln(a). Proof: We only need consider the case n=2, as then the other cases follow by iteration. Let tl be the single member of T{ n T2, so that a^r,) = a2{tx). The relation L^ia) = Ll(al)L2(a1) is true just in case the relation Px(a) = Px(ax)P2{a2) is true when a satisfies bcx, a{ satisifies bcx and a2 satisfies bc2. Let us therefore assume that a, a, and a2 all satisfy their respective boundary conditions. We further assume wlog that bcx falls within T{, so that bcx constrains Xj only, and thus is equivalent to bcv We then have Px{a) = P x (a,&a 2 ) 157 Pr(a, & a 2 U x & bcx) Pr(a, &a 2 U, &£2&Z?c,) Pr(a, Ui (fe^fcZjc^PrC^U, & Z , & i 2 & £ c , ) Pr(a, | & £c,)Pr(a 2 | a,(f,) & £,) (Note that this proof is exactly parallel to the proof of Theorem 4.4.3, and also relies on CSM3. It is necessary to assume, for instance, that a = a] & a2.) Since the assumption that a, a, and a2 all satisfy their respective boundary conditions leads to the conclusion that Px(a) = Pl(a])P2(a2), it follows that L^a) = Lx(ax)L2(a2)M The intuitive idea of the law function is a measure of dynamical facility, of how easily the system follows the trajectory in question. Now, the dynamical facility of a sub-history such as a2 should depend only upon the lagrangian i^, so it should be the case that Lx(a2) = L2(a2). We shall now prove this. 4.6.5 Theorem £x(«,) = £ , ( a / ) -Proof. Let us consider the connected partition {Tt, Tj, TJ of Tx, where 7' and V should be read 'lower' and 'upper'. We then have that: Lx(ai)=JiLK(xl&a.&xu) where the variables xt and xu range over all histories in Tt, Tu respectively which connect to at. Using Theorem 4.6.4 it follows that 158 Lx(ai)=JjL,(xl)Ll(ai)Lu(xu) xl >xu = L,(ai)JJLl(xl)'^Lu(xu) The last line follows using point normalisation (Theorem 4.6.2) since the variables xt and xu are each constrained at a single instant of time.B 4.6.6 Corollaries (i) If bcx is set in Tt, and consistent with a(, then /^(a^L^a,). (ii) If bcx is set in Tt then Px(a,) = Z^(a()^Z,(x;). We are finally in a position to prove the Markov condition. 4.6.6 Theorem Consider sub-histories ax, a2, a 3 in time intervals Tx, T2, T 3 respectively, where TX<T2< T3. (These intervals may or may not form a connected partition of Tx.) Then Px[ax & a 3 | a2) = Px(ax \ a2)Px(a31 a2). This is easily proved from the following lemma. 4.6.7 Lemma Consider the connected partition {Th T2, Tu}of Tx, so that Tx e Tt and T3 e Tu. Then, for all sub-histories xt, xu in T ;, T u, PX(X[ & x„ | a2) = Px(xt I aJPjfau I fl2)-Proof: From Axiom 4 we have: ^ , „ i s Pyix.&a.&xJ Px(xl&xtt\aJ = -*±J 2- "—. X ' 2 P x (« 2 ) Case (i): Suppose that bcx is set somewhere in T{. Then, for every JCJ that satisfies bcx, 159 P x ( x , & s > 2 ) = Z « ( x ' & f l » & X - ) _ ^U;)^2(Q2)L«(-cJ J1/ Note that V; is constrained to agree with a2 at their point of intersection, and also to satisfy bcx, so the denominator is less than one. By similar reasoning we find that and hence Px(xi & x u I ai) ~ Px(xi I ai)Px(xu I a2)-Case (ii): &cx is set somewhere in T2. It is trivial to show that Px(JC, & xu \ a2) = Px(Xl\a2)Px(xu\a2). Case (iii): bcx is set somewhere in Tu. From case (i), P x ( x , & xu \a2) = / ^ ( X / | a2)Px(xu \ a2) follows by symmetry. Thus we infer that Px(xl & xu \ a2) = Px(xt \ a2)Px(xu \ a2), as required." 160 Proofof Theorem 4.6.6: Px(a1&a3lo2)= X E P x ( ^ / & ^ l « 2 ) = X X P X . ( * / l « 2 ) ^ x ( * > 2 ) = Px(a,la2)Px(a3la2). This is the required result.* The chance of an event does not vary with time. Yet, since chance is relativised to a system, and time slices are systems, we can easily define a time-dependent chance, which I call up-to-date (utd) chance. The basic idea is that the chance at time t of some later event E is the chance of E within the time slice system in [t, x]. Note that this depends on the actual history before t. It is not predictable in advance of t what the chance of E at t will be. 4.6.8 Definition Consider a sub-history x( in T{, and a time f<T,, so that T, c [t, x]. Then the chance at time t of xt, written Pt(Xj), is the chance of x{ within the time slice [t, x]. As one might expect, utd chances are related to conditional chances, as the following results show. The interval Tr is [t, tt], where tt is the lower bound of Tt. 4.6.9 Lemma Pt 0 ; ) = A (x,. ) ^ Lr (xr), where the xr are all consistent with xt and X(t). xr Proof: J ? ( * , ) = X l , ( * r & * , ) xr 161 4.6.10 Theorem If X[0,t] specifies the actual history of X in [0,f], then P,(x,.) = Px(x,.|X[0,t]). Proof. Let the actual history in [0,f] be a0. Then Px(a0&x,) Px(x,la0) ^ x ( « o ) £ P x ( a 0 & x r & x , ) " PM XL J(a 0)L r(x r)A(x,) A ( « o ) = i f ( x J ) j ; A ( * r ) -Using Lemma 4.6.9, it then follows that P,(x,) = Px(xi I X[0,t]), as required.* 4.6.11 Corollary Pf(x,) = Px(x,. | X(0). Proof. Immediate from theorems 4.6.10 and 4.6.6. Note that, since x, may be a state, a utd chance may be a forward transition probability. 4.7 The Arrow of Time In the previous section we saw that postulate CSM5, which says that the physical boundary condition of every system is a constraint at a single point of time, has two theoretical virtues: It 162 allows C S M 4 to be true, and together with C S M 4 entails the Markov relation. In this section I shall argue that a single-instant boundary condition also provides a suitable arrow of time. What are the phenomena that constitute the "arrow(s) of time"? The three such phenomena considered in this chapter are, as mentioned in § 4 . 1 : (i) Traces of an event (such as a crater, with respect to a meteor impact) may only succeed, not precede that event. (ii) Forward, but not backward, transition probabilities are lawlike and time-independent. (iii) Entropy tends to increase. Other such phenomena are sometimes discussed as arrows of time, such as the expansion of the cosmos and the outward propagation of radiation, but these three are sufficient to demonstrate how CSM can be used to investigate the arrow of time. The puzzle created by these phenomena lies in the fact that the laws of physics are time symmetric. If, therefore, one explains phenomena only by these laws, then one is in the difficult position of explaining an asymmetric effect by appeal to a symmetric cause. Of course, if the cosmos is stochastic, then the actual history may turn out to be time asymmetric just by chance. However, the fact that there are several distinct arrows of time, which are all consistent with one another, strongly suggests that the actual history has some cause that is not time symmetric.13 In my view there is such a time-asymmetric cause, which is the boundary condition. The boundary condition is not time asymmetric by virtue of the state that it fixes, but rather from being a constraint on the cosmos at one instant of time. Any such boundary condition, regardless of the state involved, will cause phenomena (i) and (ii). Moreover, if the fixed state is also one of low entropy, then (iii) is brought about as well. 1 3For a fuller discussion of this problem, see Price (1996: 16-21), Sklar (1986) or Davies (1974). 163 4.7.1 Time's Arrow and causation In the previous section it was shown how, if the boundary condition constrains the actual history at just one instant, one time slice may provide a boundary condition for its neighbour. In the special case where the boundary condition is set at either extreme of Tx, this means that the time slices of X are well ordered by the cause-effect relation, as will now be shown. Suppose Tx = [0,10] and bcx is set at ?=10 (it would just as well if set at t=0). Also, let us consider the arbitrary connected partition {[0,1], [1,3], [3,8], [8,10]} of Tx. Since the time slice in [8,10] has a boundary condition, it has a sufficient cause for its actual history, so an actual history for [8,10] comes into being. This actual history fixes the state of X at t=S, which is part of the interval [3,8], so then this time slice acquires a boundary condition. Having a boundary condition, its actual history is also caused to exist, which in turn sets a boundary condition for [1,3] and so on. The actual history in [8,10] is brought about by the boundary condition, and this history causes the history in [3,8], which causes the history in [1,3], which causes the actual history in [0,1]. There is nothing special about the partition chosen; the same causal ordering would hold for any such partition. Also, I considered a boundary condition at f=10 rather than t=Q, making causation run "backwards", to show that the causal order does not depend on the <-ordering of the real numbers. Note that, on this approach, the coincidence of temporal order with causal order is not analytic but merely a physical fact. If the boundary condition is set at some time which is not an extremum of Tx, for example, then the cause-effect relation does not provide even a linear ordering of the time slices. Moreover, if the boundary condition constrained the actual history at two or more different times, then presumably an even stranger situation would obtain. In the above example there is no mention of any causal ordering within a single time slice, such as [3,8]. Since CSM4 allows any partition into time slices, however, we can consider the time slice in [3,8] to be composed of two subsystems, one in [3,5] and the other in 164 [5,8] perhaps. From this perspective we see that the actual history in [5,8] causes the history in [3,5], and so a little more of the causal order is revealed. Since this causal structure will appear for any connected partition of Tx, no matter how fine, we see that there is in fact a continuous causal flow of events (in a closed system) rather than a discrete causal chain. The above talk of a causal order of becoming and a continuous causal flow may be taken to suggest that I have abandoned my earlier commitment to a "B-series" view of time. This is not the case. The recognition of a causal structure in spacetime does not require one to believe in a moving present. I do hope, however, that this recognition may weaken the common intuition that the B-series view leaves out something important about time which distinguishes it from a mere physical dimension. Causation does indeed relate to time and space quite differently, and there really is something which may be called a flow of potentiality into actuality. Perhaps the so-called "flow" of time is really just the causal ordering? In this chapter I am concerned with time rather than with space, so the three spatial dimensions have been ignored altogether. If they were included then we would have to deal with spacetime slices rather than time slices, and each spacetime slice would be a cause of the other slices which are in its future light cone. Thus, although I am currently ignoring space it seems that my account is capable of being extended to cover a four-dimensional cosmos. This idea of a causal arrow of time which arises from a single-instant boundary condition is not often taken seriously, even though it fits very well with our pre-theoretical intuitions about time. Why is this? I think there are two main reasons. First, to explain the arrows of time by means of a physical constraint on the whole cosmos seems like resorting to a deus ex machina. One would prefer to explain all phenomena by means of the intrinsic properties of the cosmos itself, with no interference from "outside". Second, the very simplicity of the hypothesis, and its conformity with common sense, can be seen as weaknesses. It may be considered merely a "folk" explanation of the phenomenon, involving pre-scientific notions. A scientific explanation would be more technical and sophisticated, and would make no fundamental use of the antiquated idea of causation. 165 With regard to the first objection, I agree that a "self-contained" explanation of the arrow of time would be preferable, in accordance with Ockham's Razor. However, the very nature of dynamical systems strongly suggests that they do not contain within themselves the entire cause of their own motions. The intrinsic nature of a system includes what we may call laws of development, governing transitions from one state to another, but it does not (we generally think) impose any lawlike constraint on the initial state, or any other state. The appeal to an external constraint is therefore unavoidable. As for the second objection, we have seen that the old idea of cause and effect is needed to give a theory of physical chance. Attempts to eliminate the notion of causation (or reduce it to something less mysterious) have failed, and there is little prospect for success in the future. If there really is a causal relation then it must be deeply involved in the arrow of time. Since the causal relation itself cannot be reduced to anything more basic, it is unlikely that its role in this phenomenon can be expressed in other, more modern, terms. 4.7.2 Temporal Symmetries of the Lagrangian Physical laws are generally thought to possess a high degree of temporal symmetry. In CSM the physical laws are represented by I, the lagrangian, so we might expect I to display some temporal symmetry as well. So far we have only postulated one such kind of symmetry for I, but it is very easy to impose two further kinds. There are two reasons to postulate that I has these temporal symmetries: First, there is evidence that real systems do have the symmetries in question. Second, it highlights the fact that the temporal asymmetries of P are due entirely to CSM5, that the boundary condition constrains the state at a single time, and not to any temporal asymmetry of I. The two kinds of time symmetry in I are expressed in terms of the law function L. They are usually called time-translation invariance and time-reversal invariance. The basic idea of time-translation invariance is that, for a given system X, £x does not vary with time. Thus, if lx 166 and £2 are lagrangians of equal-duration time slices of the same system, then £X = £2. In terms of the law function we have the following definition. Definition The lagrangian £ x is time-translation invariant (tti) just in case for any two equal-duration time slices of X, such as Xj and X 2 , and any possible history a of X, , L , (a) = L2{a). The basic idea of time-reversal invariance is that the lagrangian £ does not distinguish between the two directions of time. To make this notion more precise we require the concept of a time-reflection, or a time-reverse, of a possible history. If we imagine that a possible history is some sort of geometrical object, then it can be reflected in a "simultaneity plane". The image ar under reflection of a history a will be different from a, and yet the two will be congruent. Intuitively, ar is just a going backwards. Definition The lagrangian £ x is time-reversal invariant (tri) just in case, for any possible history a of X, Lx(a) = Lx(ar). The property of tti entails that forward transition chances are independent of time, i.e. Px(X(^+cO=& I X(f,)=a) = Px(X(t2+d)=b I X(t2)=a). The property of tri entails such relations as Px(X(t2)=b I X(t{)=a) = Px{X(t2)=ar I X ( f , ) = n 167 4.7.3 Time's Arrow and Chance We have supposed that I has every conceivable kind of time symmetry, and these symmetries show up as time-symmetries in P; thus P is also highly symmetric with respect to time. Let us now fix our time coordinates so that the boundary condition is set at t=0, and only consider positive times, so that Tx is the interval [0, x].14 There are then some important temporal asymmetries in P, of which we shall now examine the following four. (i) The forward transition chances are independent of time, whereas the backward ones are not.15 (ii) Interference with a system in a time interval T{ alters the chance distributions for time slices after Tt, but not before Tt. (iii) Diffusion occurs toward the future, but not toward the past. (iv) Reichenbach's Common Cause Principle holds, i.e. there exist conjunctive forks open to the future, but not open to the past. Perhaps a better way of saying (i) is that the forward transition chances depend upon the lagrangian alone, whereas the backward transition chances depend upon the boundary condition as well as the lagrangian. This is why the backward transition chances are time dependent; they vary with the proximity of the boundary condition, and this proximity depends on the time. It also accords with the common view that the forward transition chances are lawlike and "physical", whereas the backward ones are suspect. We have seen, from Theorem 4.6.10 and Corollary 4.6.11, that: Px(xI\X(t)) = L!(xL)JjLr(xr). 1 4 T h i s is in order to make the temporal direction a global, rather than merely local, one. 1 5 T h i s matter is discussed in Arntzenius (1995). 168 Now, if we contract the sub-history JC, to a single state b, at time t+d say, then L((x()=l. We thus infer that Px(X(t + d)=blX{t) = A) = X LMr), xr where the histories jc r in [t, t+d] are all such that xr(t) = a and xr(t+d) = b. Intuitively, the forward transition chance is just the law value of the class of histories which contain the transition in question from a to b. Since we have assumed that the law function is time-translation invariant, it immediately follows that the forward transition chances are independent of time. Theorem The forward transition chances are independent of time, i.e. for any times ty, t2 e Tx, Px(X{tx+d)=b | X(f,)=a) = Px(X(t2+d)=b | X(t2)=a). Proof. Consider the two sub-intervals of Tx, Tl=[tl, tx+d] and T2=[t2, t2+d], which are of equal duration, and a sub-history JC in [tu tx+d] where x(tx) = a and x(t{+d) = b. By Thm. 4.6.10 we have that P x(JC | X(r,)=a) = L[(JC). Similarly, if JC' is the time translation of JC into T2, we have that P x(JC' | X(t2)=a) = L2{x'). Thus, since P^X^+d)^ \ X(tx)=d) and Px(X(t2+d)=b \ X(t2)=a) are just sums of terms like P x(JC | X(Y,)=a) and P x(JC' | X(t2)=d), and by tti L ^ J C ^ Z ^ C J C ' ) , it follows that Px(X(t{+d)=b \ X(t{)=a) = Px(X(t2+d)=b \ X(t2)=a).a Theorem The backward transition chances depend on the boundary condition, and are not independent of time. 169 Proof: Let us consider the backward conditional chance Px(ax I ai)> where a, is in [t,t+d] and a2 is in [t+d,t+d+d']. We have that Px(a,la2) P^&a^ _ £ o ^Px(x0&X]&a2) x0>xl lA>K)A(ai)^(«2) xo ^L0(x0)Lt(xi)L2(a2) x0'xl f o If the temporal order of ax and a2 were reversed, then Px(a{ I #2) would just be Lx(a{). Thus we see that that the backward transition probability has two additional terms, one in the numerator and one in the denominator. Both of these depend on the boundary condition, and thus on the time.* Now let us prove property (ii). Consider the chance function for a system X which suffers outside interference within some time interval Tt. The standard example of this is where X is a patch of sand, and the interference consists of a foot stepping on it. Observations indicate that after the footstep there is a foot-shaped impression in the sand which did not exist before. The difficulty is to explain the time-asymmetry of this phenomenon. We represent the fact that the system X is disturbed during Tt in the usual way, by modifying lt and thus L (. 170 Theorem (a) The chance function before T, is independent of the interaction in 71,. (b) The chance function after Tt is dependent on the interaction in Tt. Proof, (a) Consider a history ax in T{ < Tt. Let T0 be the closed interval from 0 to T{, and T2 be the closed interval from T{ to Tt, as in Figure 4.1. Then Px(a,)= ^PxixQ&a^&Xi&Xi&x^) X0'X2-xi 'xl = Ly(a1)^L0(x0)^L1(x2)Y,^xi)^Li(x?i). X0 X2 XJ x 3 Now, the variable x0 is constrained by the boundary condition, and also to fit with a, at the intersection of T0 with Tl. The variable x2 is constrained at the lower end only, to fit with ax at T{ n T2. In a similar way, JC, and JC3 are constrained only at the lower end. Thus, by Theorem 4.6.2, ^Z,(x 3) = 1, and may be cancelled from the summation. The same applies to all the other sums whose variables have only one constraint. This leaves us with Px(al) = Ll(al)^Li)(x0), which is clearly independent of L,. u Figure 4.1 (b) Consider a history a2 in T2 > Tt. This time let T0 be the entire closed interval from 0 to T,, and let Tl be the closed interval from Tt to T2, as in Figure 4.2. Then 171 x0< xi' xt,x, = ^L\(Xo)Li(xi)L\(xl)L1(a2)Li(x?i) x0 < xi' = L^(a2)JjL^(x0)JjLl(xi)^L[(x^Li(x3) X0 X: X, X , x„ Xi xt Note that x{ is constrained by x( and a2, and thus ^ A ^ i ) 1 S n o t e c l u a l t 0 o n e> s o m a t it cannot x\ be cancelled. The same is then true of xt, so that the total expression depends on LtM u Tt Figure 4.2 Now let us prove property (iii), that diffusion occurs in the forward direction of time. To examine this issue we will need a model of a process in which diffusion can occur. The Ehrenfest urn model, though rather simple, is sufficient for our modest purpose. The Ehrenfest urn model is a chamber with two compartments, separated by a slightly-permeable membrane. The chamber contains N particles, each of which is in one of the compartments. The particles are free to move within each compartment, and also have a small chance, at any time, of passing through the membrane into the other compartment. To keep things simple we will not consider the exact position and momentum of each particle, but only the compartment it occupies. Thus a state of the system specifies the compartment (A or B) of each of the N particles. Also we will only consider the state of the system at integer times t=0, 1, 2, ... etc., so that if a particle penetrates the membrane twice in [2,3] for example this motion 172 will not show up in the history. A history of the urn specifies the compartment occupied by each particle at all the times {0,1,2,...,n). The dynamics of the urn are given by its law function, which is as follows. For a possible history x, let k be the total number of membrane crossings in x. For instance, if /V=3, and particles 1, 2 and 3 pass through the membrane 5, 1 and 3 times (respectively) during x, then k=5+1+3=9. Let p be some small number, such as 0.01. Then Lx(x)=pKl-p)nN-k. Intuitively this means that each particle has chance p of crossing to the other compartment in each unit of time, and chance \-p of staying where it is. It also says that the particles are stochastically independent. Since the quantity p is independent of time, the law function is time-translation invariant. Also, the time reverse of a compartment switch is a compartment switch, so the law function is time-reversal invariant as well. What is diffusion? Diffusion is a phenomenon that appears at the macro level, so let us define some macrostates for the urn. A macrostate of the urn is a specification of the number of particles in each compartment. We can use the notation <a,b> to represent the macrostate in which there are a particles in compartment A and b in B. In this model, diffusion occurs when the numbers of particles in the two compartments become more nearly equal, such as in the transition from <15, 3> to <11,7>. A transition from <9,9> to <6,12>, on the other hand, is the opposite of diffusion. As usual, I assume that the boundary condition is set at t=0. We are then required to prove that there is a temporal asymmetry in Px as far as diffusion is concerned. Roughly speaking, the system is disposed to undergo diffusion in the forward direction of time, and not in the backward direction. Of course, for diffusion to occur it must be physically possible. If the boundary condition is that the state at t=0 is <9,9>, then because this is already a state of equilibrium, or maximum diffusion, any further diffusion is impossible. We shall therefore 173 assume that the boundary condition involves a macrostate like <15,3>, which is far from equilibrium. An important feature of a macrostate is the number of states which are consistent with it. (We may, if we wish, regard a macrostate simply as the class of states which are consistent with it.) The macrostate <0,4>, for instance, has only one state, whereas the macrostate <1,3> is consistent with four states, as there are four particles which could be the one in compartment A. Let us call the logarithm of the cardinal number of a macrostate its entropy.16 In general, for fixed N, the entropies of more diffused macrostates are much greater than those further from equilibrium. The proof of forward diffusion relies on two facts: First, it uses Lemma, that forward transition chances are just the law values, or sums of law values, for sub-histories in that time interval. Second, it uses the fact that diffusion involves an increase of entropy. A helpful way to proceed is to consider chances of single transitions, i.e. forward transition chances of the form Px(X(t+l)=x' \ X(t)=x), and backward transition chances of the form Px(X(t-l)=x' | X(t)=x). Using Lemma, the forward transition chance is just the law value of the sub-history xt with state x at t and x at If this sub-history xt contains j particle movements then it may be shown that L,(x,)=p/(i-p)*-;. We then have that: Px(X(t+\)=x' I X(t)=x) = pi(\-p)N-J. I 6 T h i s quantity is not exactly the Boltzmann entropy, of course, but it is related to it. 174 Now we must consider how the value this expression depends on the entropy change in the sub-history xr The answer is: Not at all! Since is tri, the time-reverse of xt has exactly the same law value as xt, and so Px(X(t+l)=x \ X(t)=x') = Px(X(t+l)=x \ X(t)=x).17 It may seem to follow from this that, in the interval [r, tt-1], the entropy is just as likely to go down as up. This is not the case, however, as if x is a non-equilibrium state then there is simply a greater number of sub-histories with increasing entropy than with decreasing entropy. This may be shown with a simple example. Suppose N=4, and the state x is <{0}, {1,2,3 }>, i.e. particle #0 is in A and particles #s 1, 2 and 3 are in compartment B. This is a non-equilibrium state, as the equilibrium states are all members of the macrostate <2,2> and this is a member of <1,3>. If the state x is <0,4> or <4,0> then entropy decreases. If x is a member of <1,3> or <3,1> then the entropy is constant. If the state x' is in <2,2> then entropy increases. Here is a table of all the important histories with their law values (I neglect histories whose law values are less than p2(l-p)2 since p~0, and so these have a negligible impact). 1 7Note that, since on this model the particles' momenta are ignored, each state is its own time reverse. 175 Table 4.1 From To Entropy Law value {0} {123} {} {0123} i p{\-pf {0} {123} {0} {123} - (\-PY {0} {123} {012} {3} - P2(1-P) 2 {0} {123} {013} {2} - P2d-P) 2 {0} {123} {023} {1} - P 2(\-P) 2 {0} {123} {01} {23} T p(l-pf {0} {123} {02} {13} T Pd-Py {0} {123} {03} {12} t P{\-pf We see that there is one history with decreasing entropy, four with constant entropy and three with increasing entropy. Overall, it is most likely that the entropy is constant, but an increase is about three times as probable as a decrease. (With a larger value for N the chance of constant entropy is greatly reduced of course.) When N is large it is not difficult to see that, on each forward transition, about proportion p of A's particles will move to B, and about proportion p of B's particles will move to A. Thus the overall flow will be about proportion p of the difference between the numbers in A and B. Now, this situation holds for every forward transition, so that in the history as a whole there will probably be a gradual evolution toward equilibrium in forward time. Can this same reasoning be applied to the backward transition chances? Table 4.7 could certainly be switched around without any change in the law values, as the law function is time-reversal invariant. The argument stops there, however, since the backward transition chances 176 are not equal to the corresponding law values. The reasoning would apply, at least to a reasonable approximation, for times long after the boundary condition, but in this region the system probably spends most of its time at equilibrium anyway, so there is no temporal asymmetry of diffusion. The temporal asymmetry of diffusion is a phenomenon that only exists near the boundary condition. It should be stressed that this reasoning also applies to the time interval r<0, if the system exists there. Thus, for negative times, the entropy will "increase into the past", i.e. actually decrease over the interval [-T, 0]. This emphasises the fact that the temporal asymmetry brought about by the boundary condition is local rather than global. In the interval 0], the forward direction of time (i.e. the direction pointing away from the boundary condition) is the negative direction, towards We finally come to the fourth temporal asymmetry of chance, which is Reichenbach's Common Cause Principle (1963:157-63). This Principle (abbreviated as 'CCP') has been interpreted and used in a number of different ways, from being a way of defining causation to providing an argument for scientific realism. I will treat it as describing an empirically-attested physical fact, which requires a physical explanation. This is basically the way Reichenbach originally viewed it, I think. Its truth as a physical claim has been questioned by van Fraassen (1980:28-31), particularly on the basis of observed correlations violating the Bell inequality. I agree that CCP is not universally valid, but it does hold within CSM. The discrepancy here is due to CSM3 which, though not universally valid, holds as a good approximation for "large" systems. Van Fraassen also claims that CCP will fail in "almost any indeterministic theory of sufficient complexity" (1980:29), but this is doubtful since CCP is true within CSM, which allows for highly complex systems. CCP is easily derived within CSM, unlike any other formalism I know of. It arises automatically, without any need for special, ad hoc, assumptions. Before we show this, however, we should say what CCP is. 177 It frequently occurs that systems which do not directly interact, at least not significantly, have actual histories which are frequency correlated. Moreover, I think, two such non-interacting systems X and Y may be correlated with respect to chance, i.e. Pz(XeB | o"(Y)) ^  Pz(XeB) (in a set of non-zero probability), for some Borel set B. A simple example of this is of two properly-functioning barometers in the same town. They will vary in sympathy, even though they do not interact with each other. In such cases of correlation without interaction, CCP says that there is always some other system, say W, which does interact with both X and Y. In our barometer example, the third stochastic process is the air pressure, of course. There could be many systems like W which interact with both X and Y, but let us assume that, in this case, W is the only one18. CCP says that if we condition on the state of the common cause, then the correlation disappears, so we have Pz(XeB{ & YeB2 | o(W)) = Pz(Xe5, | a(W) ).Pz(YGB2 | a(W)), a.s. CCP involves a temporal asymmetry. To see this, let us suppose that W interacts with X and Y only in some bounded interval of time Tt. CCP then says that the X-Y correlation may exist after Tt, but not before. In other words, it may be that P z(X,€fl|a(Y,))*P2(X,eB), 1 8 W e could always let W represent all these processes lumped together, i f there were more than one. 178 but only if t is after Tt. Before the interaction there can be no correlation at all. When Reichenbach says that conjunctive forks are always closed to the past, but usually open to the future, this is basically what he means. The Common Cause Principle can therefore be expressed as the following four claims: (I) If X and Y cannot interact, and there is no "common cause" W, then X and Y are uncorrelated. (II) Where there is such a system W which may interact with them both, X and Y can be correlated. (III) In such cases of correlation, the correlation disappears when we condition on (the sigma field generated by) W. (IV) If the interaction is restricted to some bounded interval of time, the correlation only exists after the interaction. The first claim is just Theorem 4.4.1. (II) and (HI) are easily deduced from the following theorem. Theorem Let Z = <W,X,Y>. If bcz entails that X and Y cannot directly interact, then Pz(x&y \ w) = Pz(x \ w)Pz(y \ w). Proof: Using Axiom 4 we have that Pz(x&y \ w) = Pz(x \ w)Pz(y \ w8oc). By the causal theory of chance, this may be written as Pr(x \ lz & bcz & w)Pr(y \ lz & bcz & w & x). The question here is whether x is relevant toy within the epistemic state K = lz& bcz & w. Now, according to lz & bcz, X may interact with W, and W with Y, but X cannot interact directly with Y. If we 179 let 7<wx> stand for "W may interact with X", and so on, then the bridge between X and Y provided by lz & bcz is the conjunction 7<w x > AW,Y>- The proposition W=w, however, gives complete information about the actual history of W, so using Corollary 4 . 3 . 1 0 the conjunctions Awx> & w< 4WY> w both factorise. Then, by Theorem 4 .3 .5 , x is not relevant to y within the epistemic state K = lz& bcz & w. The result is immediate.* Corollary If X and Y can each interact with W, though not with each other, then X and Y may be correlated. Proof: Since X and Y can each interact with W, Pz(x \ w) and Pz(y | w) can vary with w. Thus, since Pz(x&y) = YJPz(x&y\w)Pz(w) = ^Pz{x\w)Pz(y\w)Pz(w), w and neither term Pz{x \ w) or Pz(y \ w) can be taken out of the sum as a constant, it may be that P z (JC&V | w) * Pzix | w)Pz(y | w)M Note that to prove the existence, rather than the possibility, of a correlation would require consideration of a particular system, with specific dynamical properties. We thus have proved both (II) and (UI). The final theorem is immediate, provided that we again interpret "before" and "after" in terms of closer to, and further from, the boundary condition. For, by Theorem 4 .7.3.4, at times before the interaction Px is exactly what it would be if there were no interaction with W, and the same is true of P Y . But, by Theorem 4 .4 .1 , where there is no interaction there is no 180 correlation. It follows then that there is no correlation between X and Y before the interactions, which is the result. 181 5. Correlat ion The topic of probabilistic correlation has become important, and puzzling, due to its central role in quantum theory. Einstein, Podolsky and Rosen (1935) showed that the rules of quantum mechanics predict correlations between measurement results which seem to require "hidden variables", i.e. physical quantities not represented in the quantum wave function. The hope of explaining these correlations using local hidden variables seemed to be quashed decisively by two results however, one mathematical and one experimental. The mathematical result was Bell's theorem (Bell, 1964) which showed, within certain assumptions, that any local hidden variable theory must differ from quantum mechanics in some of its empirical predictions. The empirical result consisted of a number of different experiments to test the predictions of hidden-variable theories against those of quantum mechanics (QM), the best of which is considered to be that of Aspect et al (1982). The verdict on these experiments, at least from 1980 onwards, is unanimous in favour of QM, and against hidden variable theories of the type Bell considered. The correlations discovered theoretically by Einstein, Podolsky and Rosen (hereafter EPR), and experimentally confirmed by Aspect, are therefore a source of controversy at the present time. They have given rise to an extraordinary range of theories attempting to account for them. In this chapter we shall discuss the most promising explanations offered to date, and find them all unsatisfactory to various degrees. We shall also see that the EPR correlations are provably impossible within CSM, which means that CSM cannot be generally valid; at best, it holds within some special case. It will be argued in §5.3 that the postulate CSM3, though approximately true of "large" systems, is false in general, and shown that when this assumption is relaxed the inconsistency between CSM and QM disappears. This result shows that a kind of "hidden variable" explanation of EPR correlations is possible, within a stochastic framework. The hidden variables are not, of course, of the type Bell considered, nor are they non-local as in Bohm's theory (Bohm, 1952). 182 5.1 Classical and Quantum Correlation The EPR correlations are correlations with respect to chance, but we shall see that they are quite different from the chance correlations allowed for within CSM. In a sense defined below, the correlations of CSM are classical, whereas the EPR correlations are not classical. 5.1.1 Classical Chance Correlations Let us review the results obtained in Chapter 4 about chance correlations. First, systems which are necessarily causally independent at all times are uncorrelated (Thm. 4.4.3). Second, if systems X and Y possibly interact within the time interval Tt, then X and Y are correlated after T, but not before. What does it mean, exactly, to say that "X and Y are correlated after 77'? Here is a suitable definition. Definition X and Y are chance correlated at time t just in case the random variables X(f) and Y(t) are correlated w.r.t. chance. By "chance" here I mean the chance function Pz, where Z=<X,Y>. In Definition 4.6.8 I also defined a time-dependent chance function Pt, which is intuitively the chance at time t. One might wonder therefore which events are correlated with respect to Pt, and for which values of t. We find that, within CSM, time-slice subsystems are only correlated with each other if those time slices themselves interact. We have, in other words, the following theorem. Theorem Suppose X, Y cannot interact outside Tt. Then, for t* > t > Tt, X(t*) and Y(r*) are uncorrelated with respect to Pt. Proof: The time slices of X, Y in [t, T] do not interact, so we can apply Thm. 4 . 4 . 3 . • 183 Since correlations within CSM all have this property, we shall call such correlations classical correlations. Definition Suppose X and Y may interact within Tt only. Then they are classically correlated (w.r.t. Pz) iff X and Y are correlated w.r.t. Pz, but uncorrelated w.r.t. P(, where t is any time after T{. One may be tempted to think that all chance correlations in nature are classical, but this is not so. We shall see that the formalism of quantum mechanics permits correlations which are not classical. 5.1.2 The Rules of Quantum Mechanics Quantum mechanics (QM) is a stochastic mechanical formalism, like CSM, but is set up rather differently. In this section I shall describe those of its properties which I need to appeal to later in the chapter. In QM a system X has a set of possible states, as in CSM. Each possible state is represented by a mathematical structure known as a wavefunction \\f. The idea that material particles such as electrons are associated with waves was first proposed by De Broglie (1924), and Schrbdinger (1926) discovered a linear, deterministic equation which seemed to govern these "matter waves". The wavefunction for an electron, say, is a complex-valued function whose domain is the set of spacetime points, if we ignore electron spin. Schrbdinger hoped that by means of this wavefunction, which at first he considered to represent some straightforward physical property, the discontinuities of Bohr's "old" quantum theory could be eliminated. Schrbdinger was able, for instance, to derive the Rydberg formula for spectral lines in the hydrogen atom from his wave equation. The attempt to remove discontinuous changes from quantum theory failed, however, since Born showed that the intensity of the wavefunction has to be interpreted as a probability. Roughly speaking, for any position x, the intensity of the wavefunction at x, i.e. I\|/(JC)I2, is the 184 probability that the electron's position, if measured, will be found to be x. Thus, upon measurement, the state of a system evolves stochastically, rather than in accordance with Schrodinger's wave equation. For a single particle, the domain of the wavefunction is just R 3, which is the position space (space of possible positions) of a single particle. For a pair of particles, however, the wavefunction is defined on R 6, as the two particles have 6 degrees of freedom in their positions. In general we might say that the domain of the wavefunction is the configuration space of the system concerned, although we must remember that the configuration space of a system in quantum mechanics is generally only half the size of the phase space of its classical counterpart. In classical mechanics, for example, the state of a particle is specified by its momentum as well as its position - a total of six numbers. To describe the formalism as simply as possible we shall use the vector space notation. A crucial fact about the set H of possible wavefunctions for a system is that they form a vector space. That is, if \|/,, \)/2 e H, then \\fl + \|/2 6 H, and if \\f e H then c\|/ e H, where c is any complex number. We can therefore consider each wavefunction in H to be a vector, indeed, they are known as state vectors. In future we will denote \|/ in the "ket" vector notation as ll|/>. We can define the scalar product of two ket vectors \\\f{> and li|/2>, written g-(l\|/1>,lvj/2>), as follows: where x ranges over the configuration space of \)/, and c* is the complex conjugate of c. It is possible to define a second vector space, in addition to the ket vector space, by considering linear functions from ket vectors to complex numbers. In other words, we can consider functions / from H to C, such that/"(l\|/i> + lx|/2>) = f\\yl>) + /0 vl /2 >)- ^ i s trivial to show that the class of such functions also forms a vector space. Moreover, for any ket vector 185 l\|/>, we can associate it with a linear function/v, using the metric g, as follows.1 Definition / v is the unique linear function/such that/(l<|») = g(ly>, \ty>), for all l(f»e//. Since/¥ is also a vector, we write it in the "bra" vector notation as <\j/l. The space of linear functions on ket vectors is then known as the bra vector space. Using this bra-ket notation, the metric g(\\\J>, l(|)>), which is equal to /v(l<|»), by definition, is written <\j/l<|)>. Since g has the form of an inner product, we can define the length or norm of a state vector lv|/> as <\|/l\|/>, and speak of two state vectors \§> and l\|/> as orthogonal when «j)l\|/> = 0. We will assume that all ket vectors l\|/> are normalised, so that k\|/lv|/>l2 = 1 The state vector l\|/> evolves in accordance with Schrodinger's equation when the system is isolated. When a measurement occurs, however, the situation is quite different. Each type of measurement on the system is represented by a linear "operator" on H, that is a function H-*H. Each such operator A has eigenstates, that is states l\|/> such that AI\|/> = a\\\i>, where a is a (complex) scalar called the eigenvalue corresponding to the eigenstate l\|/>. A linear operator A is called Hermitian if g(l(j»AI^>) = g(AI<|»,l\|/>). Hermitian operators have two important properties, that (i) their eigenvalues are real, i.e. their imaginary part is zero, and (ii) the eigenstates corresponding to distinct eigenvalues are orthogonal. The rule for measurement in quantum mechanics, known as the Projection Postulate, can now be expressed as follows. If a measurement is made which corresponds to a Hermitian operator A, then the possible states of the system after the measurement are just the eigenstates of A, which we can write as la,>, \a2>, .... \ai>,..., regardless of the original state lvp> of the system. The "probability" of each eigenstate la,> is the squared scalar product or projection ka,lv|/>l2. It seems that "probability" means the "up to date" (utd) chance Pt, where t is any time just before the measurement occurs. For each possible final state la,> the outcome of the 'This presentation of Q M is partially based upon that of Wi l l i am Unruh (1994). 186 measurement, i.e. the measured value for the quantity associated with A, is just a,-, the eigenvalue associated with la,>. There are operators for each classical property of the system, such as position, momentum, energy, angular momentum, and so on, as well as some new properties such as intrinsic spin. In general, however, a state vector does not contain a definite value for each of these properties. To say that a system possesses a definite value a for quantity A means, at the least, that a measurement of A would necessarily yield the value a; yet for a quantum state li|/> this only holds in the rare case that l\|/> is an eigenstate of A. The usual situation is that lv|/> is a linear combination, or superposition, of A's eigenstates, i.e. l\|/> = c{\a{> + c2\a2> + ...+ c,la,> + where Xlc,l2 =1. If the property A is measured on a system in such a superposition with respect to A, the wavefunction jumps to one of the eigenstates la,->, as stated above. This change is sometimes called the "collapse" of the wavefunction. For two Hermitian operators A and B we define the commutator of A and B, written [A,B], as AB - BA. If [A,B] = 0, i.e. [A,5]l\|/> = 0 for all yeH, then the operators A and B are said to com