Classical Causal Models for Bell and Kochen-Specker Inequality Violations Require Fine-Tuning

Nonlocality and contextuality are at the root of conceptual puzzles in quantum mechanics, and are key resources for quantum advantage in information-processing tasks. Bell nonlocality is best understood as the incompatibility between quantum correlations and the classical theory of causality, applied to relativistic causal structure. Contextuality, on the other hand, is on a more controversial foundation. In this work, I provide a common conceptual ground between nonlocality and contextuality as violations of classical causality. First, I show that Bell inequalities can be derived solely from the assumptions of no-signalling and no-fine-tuning of the causal model. This removes two extra assumptions from a recent result from Wood and Spekkens, and remarkably, does not require any assumption related to independence of measurement settings -- unlike all other derivations of Bell inequalities. I then introduce a formalism to represent contextuality scenarios within causal models and show that all classical causal models for violations of a Kochen-Specker inequality require fine-tuning. Thus the quantum violation of classical causality goes beyond the case of space-like separated systems, and manifests already in scenarios involving single systems.


I. INTRODUCTION
Quantum contextuality, the phenomenon uncovered by Kochen and Specker (KS) [1], is at the core of the quantum departure from classicality, and has recently been identified as a candidate for the resource behind the power of quantum computation [2]. Much controversy still exists, however, on what exactly contextuality is, with different formalisms giving different definitions of the phenomenon [3][4][5][6]. For example, derivations following the work of Kochen-Specker require an assumption of outcome determinism, the validity of which in experimentally relevant situations has been criticised [3,7]. Indeed, it has been argued that it is not possible to experimentally test contextuality without extra assumptions [8].
Bell nonlocality [9] rests on comparatively solid foundations. It is best understood as the incompatibility between quantum correlations and causal constraints [10]. A modern approach is to capture these constraints within the framework of causal networks [11], where causal structure is represented as a directed acyclic graph (DAG) (Fig. 1). Assuming a particular causal graph -in the Bell case, motivated by relativity -implies constraints on observable probability distributions, such as Bell inequalities. A violation of a Bell inequality can thus be understood as a violation of one or more of the assumptions underlying that framework, such as Reichenbach's principle of common cause [12,13].
The causal network formalism was brought to the attention of the quantum foundations community in an influential work by Wood and Spekkens [14], who showed that Bell inequalities can be derived from a core principle of that framework: no-fine-tuning. That is, every causal model that allows for certain Bell inequality violations requires causal connections not observed in the phenomena, * e.cavalcanti@griffith.edu.au such as faster-than-light signalling. Only with special, finely-tuned, parameters can a causal model "hide" those connections from observers. This provides a novel way of looking at Bell inequalities -not as implications of relativistic causal structure, but as implications of classical causal principles for any causal structure. That result, however, applied only to phenomena satisfying two extra assumptions, which makes it inapplicable to contextuality scenarios. Here I will prove a more general result without those extra conditions.
The framework of causal networks has motivated several new directions for the study of quantum causality. One promising programme is to extend the classical causal formalism to a framework of quantum causal models [13,[15][16][17][18][19], opening the exciting prospect of a coherent understanding of the nature of causality in a quantum world, and a resolution of at least part of the puzzle of Bell's theorem [20].
Contextuality, on the other hand, is a priori unrelated to causality -it is not necessary that measurements are space-like separated, or that they involve separate subsystems at all. Thus it is not clear that a theory of quantum causality could help with contextuality. Here I bridge that gap and show that fine-tuning is required in all causal models that reproduce the violation of KScontextuality inequalities. This unifies Bell-nonlocality and KS-contextuality as violations of classical causality, and opens a new direction to study contextuality as a resource.
This work is organised as follows. First I review the framework of causal models, and how it can be used to derive Bell inequalities by assuming free choice and relativistic causality. I will review how [14] derives Bell inequalities from no-signalling, no-fine-tuning, marginal setting independence, and local setting dependence. I will then show that, remarkably, Bell inequalities can be derived from no-signalling, and no-fine-tuning alone. This can be seen as a corollary of a more general result: KSinequalities can be derived from no-disturbance and no-fine-tuning. In a brief summary, dropping the assumption of marginal setting independence allows for choices of settings to be correlated, as they typically are in a contextuality test; the assumption of no-disturbance (which reduces to no-signalling in the Bell scenarios) is implied by the compatibility of observables; and in a general contextuality scenario, the factorisability condition derived in the proof of the previous result now implies KSinequalities rather than Bell inequalities. I conclude with a discussion of the relevance of this result, its drawbacks, and opportunities for future research.

II. CAUSAL MODELS
A modern framework for causation and its role in explaining correlations can be found in the theory of causal networks [11]. With extensive applicability from statistics to epidemiology, economics, and artificial intelligence, it has been developed as a tool to connect causal inferences and probabilistic observations. In such a model, causal structure is represented as a graph G, with variables as nodes and direct causal links as directed edges (arrows) between nodes. To avoid the potential for paradoxical causal loops, closed cycles are forbidden, and the resulting structure is that of a directed acyclic graph (DAG) (Fig. 1). The relations between nodes in a DAG G can be expressed in an intuitive genealogic terminology: nodes pointing to a given node X (the direct causes of X) are called the parents of X, denoted as P a(X); the ancestors of X, An(X), are all nodes from which there is a directed path to X (i.e. all variables in the causal past of X); the descendants of X, De(X), are all nodes for which X is an ancestor (i.e. all variables in the causal future of X); the set of non-descendants of X is denoted by N d(X). The purpose of a DAG is to encode the conditional independences associated with any probability distribution compatible with the causal structure, through the Causal Markov Condition: in any probability distribution P that is compatible with a graph G, a variable X is independent of all its non-descendants, conditional on its parents. That is, P (X|N d(X), P a(X)) = P (X|P a(X), which we denote as (X ⊥ ⊥ N d(X)|P a(X)). The Causal Markov Condition is equivalent to the requirement that any distribution over the variables X 1 , ..., X n compatible with the graph G factorises as P (X 1 , ..., X n ) = j P (X j |P a(X j )) . (1) Those conditional independences can be obtained from the graph through a rule called d-separation [11]. Two sets of variables X and Y are d-separated given a set of variables Z (denoted (X ⊥ ⊥ Y |Z) d ) if and only if Z "blocks" all paths p from X to Y . A path p is blocked by Z if and only if (i) it contains a chain A → B → C or a fork A ← B → C such that the middle node B is in Z, or (ii) it contains an inverted fork (head-to-head) A → B ← C such that the node B is not in Z, and there is no directed path from B to any member of Z.
D-separation is a sound and complete criterion for conditional independence: if in a DAG G two variables X and Y are d-separated given Z, (X ⊥ ⊥ Y |Z) d , then they are conditionally independent given Z, (X ⊥ ⊥ Y |Z), in all distributions compatible with G; and if for all distributions compatible with G, the conditional independence (X ⊥ ⊥ Y |Z) holds, then G satisfies (X ⊥ ⊥ Y |Z) d .

A. Causal models and Bell's theorem
As an example of the application of this framework, we review how it can be used to derive Bell's theorem. Consider the correlations between measurements performed by two agents, Alice and Bob. Alice's choice of measurement is represented by a variable X, and Bob's by a variable Y . Their respective outcomes are represented by A and B. Their measurements are assumed to be performed within space-like separated regions, so that no relativistic causal connection can exist between the variables in Alice's lab and those in Bob's lab. They may however be correlated due to variables in their common causal past, the set of which is denoted by Λ. The assumption that Alice and Bob can make "free choices" is translated as the requirement that X and Y are exogenous variables: they have no relevant causes. This scenario is represented in the graphical notation as in Fig. 2.

A B
 Y X The Causal Markov Condition then implies that P (ABΛ|XY ) = P (A|XΛ)P (B|Y Λ)P (Λ). Averaging over Λ we obtain the factorisability condition of a local hidden variable model: As is well known, this leads to the Bell inequalities, which can be violated by quantum correlations [21]. Therefore, assuming relativistic causal structure and free choices, quantum correlations cannot be reproduced by the classical framework of causality.
Note that the assumption of "free choice" is not strictly necessary: a weaker but sufficient condition is simply "Λindependence", that the measurement choices are independent of any latent variables Λ that are causally connected with the systems, (Λ ⊥ ⊥ XY ). This would still be compatible with a causal graph where there is a common cause between the measurement choices X and Y . But without one of these assumptions, it would be possible for a conspiratorial "superdeterministic" theory to reproduce the quantum correlations. Remarkably, in the main result of this paper neither of these assumptions is needed. Superdeterministic theories, for example, are ruled out because they violate no-fine-tuning.

III. MEASUREMENT SCENARIOS AND CAUSAL MODELS
Traditionally, locality and noncontextuality have been spelled out in terms of ontological models, but here I translate those concepts into the language of causal models. Indeed, we can think of ontological models as causal models in disguise, which is an useful perspective as it allows one to more clearly identify implicit classical causal assumptions that may be revised in light of quantum causal models. The formalism used here is most analogous to that of Abramsky and Brandenburger [4], although it is expressed in the language of causal models, and uses simplified terminology (for example, I do not need to refer to sheaf theory).
A measurement scenario is specified by: i) A set M of measurements; ii) for each measurement m ∈ M, a set of possible outcomes O m ; iii) a compatibility structure C on M -a family of subsets of M -specifying joint measurability. Two measurements m, n ∈ M are said to be jointly measurable, compatible, or to be part of a context, iff {m, n} ∈ C, and likewise for sets of more than two measurements [22]. Without loss of generality, we can enlarge each O m so that all measurements have the same number of outcomes, and label them so that all outcome sets are equal and denoted simply by O.
Consider now an individual test within a measurement scenario, where a set of n random variables X 1 , ..., X n specifies n measurements to be performed upon the system. A contextuality scenario is one in which for any given run x 1 , ..., x n are jointly measurable. That is, let X i = x i ∈ M denote the values of those variables in a particular run. Then {x 1 , ..., x n } ∈ C. This could be done, for example, through a further random variable C that selects a context, from which X 1 , ..., X n are determined, but we make no assumption about the process of selection or the order in which the measurements are performed. A special class of contextuality scenarios is that in which M can be decomposed into k subsets M 1 , M 2 , ..., M k such that all contexts c ∈ C have at most one element from each subset. These are called (k-partite) Bell-nonlocality scenarios.
From here on we consider scenarios containing only pairs of compatible measurements. This doesn't necessarily mean that there are no sets of three or more compatible measurements that can be performed on the system, but only that we are restricting M so that it only contains pairs of them. These will be called binary contextuality scenarios, a special case of which are bipartite Bell scenarios. The measurements will be chosen through random variables X and Y , with outcomes respectively recorded by random variables A and B.
A phenomenon P for such a scenario is specified by a probability distribution P(ABXY ) for the observable variables. A special class of phenomena are those where the probability for the outcome of one measurement does not depend on which other measurement is performed on the system, i.e. that satisfy the property of no-disturbance.

Definition 1 (No-disturbance).
A phenomenon is said to satisfy no-disturbance iff P(A|XY ) = P(A|X) and P(B|XY ) = P(B|Y ) for all values of the variables A, B, X, Y for which those conditionals are defined.
In the causal-model notation, the no-disturbance conditions are denoted by (A ⊥ ⊥ Y |X) and (B ⊥ ⊥ X|Y ). Contextuality scenarios naturally satisfy those conditions, since X and Y can only take joint values as pairs of compatible measurements, and the no-disturbance condition is implicit in the very meaning of compatibility [4]. In Bell scenarios this assumption is called no-signalling, and where the measurements X and Y are performed in space-like separated regions, it is justified by relativity.
Note that no causal assumption is made up to this stage. We now define what we mean by a (classical) causal model for a phenomenon.
Definition 2 (Causal model). A causal model Γ for a phenomenon P consists of a (possibly empty) set of latent variables Λ, a DAG G with nodes {A, B, X, Y, Λ}, and a probability distribution P (ABXY Λ) compatible with G, such that P(ABXY ) = Λ P (ABXY Λ).

A. Causal models for KS-noncontextuality violations require fine-tuning
Whereas no-disturbance and no-signalling are purely properties of phenomena, the definitions below are about properties of causal models for phenomena in contextuality scenarios. We may also say that a phenomenon violates a certain property when no causal model for the phenomenon satisfies that property.
A natural requirement within the causal models framework is the assumption of no-fine-tuning, or faithfulness. It has the flavour of Occam's razor: one should not postulate causal connections that are not apparent in the phenomena. If we can't signal faster than light, say, we should not a priori postulate faster-than-light causation. It can also be motivated by the principle that if a phenomenon has a certain symmetry, the causal model should respect that symmetry. No-disturbance, for example, can be understood as a symmetry of the phenomena.
With this motivation in mind, suppose that the phenomenon displays some conditional independence, say (A ⊥ ⊥ Y |X), but that the actual causal structure in the world is such that the corresponding d-separation, (A ⊥ ⊥ Y |X) d doesn't hold. For example, there may actually be a causal link from Y to A. But if that's the case, why can't we observe them to be correlated? The only way this can be the case is if some of the parameters of the model (e.g. the distribution of the latent variables) takes a special distribution of values to wash out those correlations. This is precisely what happens, e.g. in Bohmian mechanics, where the hidden variables are constrained to satisfy the "quantum equilibrium" condition [23]. In other words, the parameters of Bohmian mechanics are finely-tuned to reproduce the quantum correlations while preserving no-signalling at the operational level. A faithful causal model, on the other hand, has no such hidden causal connections.
Definition 3 (Faithfulness (no fine-tuning)). A causal model Γ is said to satisfy no fine-tuning or be faithful relative to a phenomenon P iff every conditional independence (C ⊥ ⊥ D|E) in P corresponds to a d-separation (C ⊥ ⊥ D|E) d in the causal graph G of Γ.
Note that although Bohmian mechanics violates nofine-tuning, this does not by itself imply that this must be the case for all causal models for quantum theory. This situation is analogous to Bell's motivation for his 1964 theorem, when he wanted to decide whether all quantum hidden variable models must have the objectionable nonlocality of Bohmian mechanics. Here we show that all classical causal models for quantum theory must have the objectionable property of fine-tuning.
Mathematically, Bell-nonlocality and KS-contextuality amount to the non-existence of a factorisable hidden variable model, which in the language of causal models translates to: Definition 4 (Factorisability). A causal model is said to satisfy factorisability iff ∀ A, B, X, Y, P (AB|XY ) = Λ P (Λ)P (A|XΛ)P (B|Y Λ). However, the motivation for this condition is quite distinct for each type of scenario. In the case of Bell scenarios, this condition is motivated, as we have seen in Sec. II A, from the conjunction of the assumptions of relativistic causal structure and Λ-independence applied to classical causal models, and we are justified in identifying Bell-locality with factorisability.
Definition 5 (Bell locality). A causal model for a Bell scenario is said to satisfy Bell-locality iff it is factorisable.
In general contextuality scenarios, the situation is more complicated. Let us recall that factorisability can also be derived from the assumptions of "parameter independence", "outcome independence" [24] and Λindependence. Parameter independence is the requirement that P (A|XY Λ) = P (A|XΛ) (and the analogous equation for B), that is, it is an assumption of "no-signalling at the ontological level". Outcome independence is the requirement that P (A|BXY Λ) = P (A|XY Λ). However, while parameter independence is implied by (measurement) non-contextuality, that is not the case for outcome independence. Measurement noncontextuality, as defined in [3], states that the probability for a measurement outcome does not depend on the context in which it occurs, which implies parameter independence, but not outcome independence.
Indeed, both the original Kochen-Specker theorem and all subsequent derivations of Kochen-Specker-type noncontextuality inequalities require the extra assumption of "outcome determinism", i.e. that the measurement outcomes are determined by Λ for each context, P (AB|XY Λ) ∈ {0, 1} (note that outcome determinism implies outcome independence). This means that violations of KS-noncontextuality inequalities cannot imply a failure of noncontextuality, since it doesn't rule out the possibility of a noncontextual indeterministic model. Furthermore, outcome determinism cannot be justified in the case of unsharp measurements [7], leading to a difficulty for experimental tests of contextuality.
Compare this situation with the Bell case. While it is true that Bell's 1964 theorem assumes outcome determinism, it has been later clarified by Bell and others that factorisability can be justified from local causality (as we have seen above) and so violations of Bell inequalities cannot be brushed off as merely implying indeterminism. This move is not available in the case of contextuality. With these caveats, I will follow the usual terminology, and define Kochen-Specker-noncontextuality as factorisability in a contextuality scenario.
Definition 6 (KS-noncontextuality). A causal model for a contextuality scenario is said to satisfy KSnoncontextuality iff it is factorisable.
From factorisability, one can derive, for each contextuality scenario, inequalities that bound the set of KS-noncontextual phenomena, as facets of a multidimensional polytope [25,26]. These are the KSinequalities, which reduce to Bell inequalities in Bell scenarios.
Wood and Spekkens [14] showed that any causal model for a no-signalling Bell-scenario phenomenon that satisfies no-fine-tuning plus the assumptions of marginal setting independence and local setting dependence is factorisable, and thus satisfies all Bell inequalities. Local setting dependence is the assumption that the local outcome is not independent of the local setting, i.e that it is not the case that (A ⊥ ⊥ X) or (B ⊥ ⊥ Y ). This assumption rules out standard examples of Bell-inequality violation with unbiased outcomes. Marginal setting independence is the assumption that the settings X and Y are uncorrelated, (X ⊥ ⊥ Y ). This rules out applying their theorem to general contextuality scenarios: while in Bell scenarios the compatibility of X and Y is guaranteed by their being chosen from disjoint sets of measurements M A and M B , in general contextuality scenarios X and Y are chosen from the same set M, and are thus not independent.
We are now ready to state our main result: No fine-tuning and no-disturbance imply KS-noncontextuality.
The proof is given in the Appendix. An immediate corollary is a stronger version of the result of [14], without the assumptions of marginal setting independence and local setting dependence: It is instructive to state Theorem 1 in a contrapositive form: Corollary 2. Every causal model that reproduces the violation of a KS-inequality in a no-disturbance phenomenon requires fine-tuning.
Corollary 3. There exist quantum phenomena involving single systems that cannot be reproduced by any classical causal model without fine-tuning.
This result implies that the quantum violation of classical causality, long recognised in the case of space-like separated entangled quantum systems, also manifests in the case of single systems.

IV. DISCUSSION
In summary, we have derived KS-inequalities from nodisturbance and no-fine-tuning. This result unifies Bell nonlocality and KS-contextuality as violations of classical causality.
Although inspired by [14], there are some important differences here. There, the no-signalling condition is justified by space-like separation, whereas here the more general no-disturbance is implied by measurement compatibility. More importantly, two extra assumptions are made in [14] but not here: i) marginal setting independence, that the measurement choices X and Y are independent; and local setting dependence, that the outcome of a measurement is dependent on its respective choice. The first excludes general contextuality scenarios, and the second rules out some textbook examples of Bell violations. Thus this work extends [14] already in Bell scenarios. Remarkably, our result needs no assumption related to independence of settings, such as free choice, Λ-independence, or marginal setting indepedence, unlike all other derivations of Bell inequalities.
One could object that to achieve the no-disturbance conditions in general contextuality scenarios, one requires perfectly compatible measurements, and that this idealisation makes it inapplicable to real experimental tests. This is true, but it is no worse than the problem faced by all standard derivations of KS-inequalities, as discussed in Sec. III A, and in [7], since the assumption of outcome determinism is incompatible with unsharp measurements. The advantage of the present work is that, at least for idealised phenomena, it allows the derivation of KSinequalities from causality principles alone, without the extra assumption of outcome determinism. Thus it allows for the conclusion that those causal principles must be revised in light of the (idealised) predictions of quantum theory, whereas no such conclusion can be reached with the usual derivation, even in the idealised casemere indeterminism is always an option.
Furthermore, the present derivation suggests a path for an experimentally robust generalisation. Given a measure of causal connection [27,28], we can propose a generalised principle of no-fine-tuning: a causal model should not allow causal connections stronger than needed to explain the observed deviations from no-disturbance. It would be interesting to determine whether testable constraints can be derived this way. Another interesting question is whether the proof in this paper can be extended to arbitrary numbers of measurements per context.
From the point of view of applications, we may understand fine-tuning as a kind of "resource waste", postulating causal links that are not directly observed in the phenomena, but are washed-out by our ignorance of underlying parameters. It would be plausible to conjecture that quantum causal models [15][16][17][18][19], on the other hand, can avoid fine-tuning in explaining contextual correlations. Such a result could provide a potential explanatory basis for the power of contextual correlations: since classical simulations of quantum correlations are essentially classical causal models, they must necessarily waste resources via fine-tuning.
Finally, it would be interesting to determine whether the generalised notions of noncontextuality given by the formalism of Spekkens [3] can also be understood as arising from no-fine-tuning.
Proof of Theorem 1. We prove by exhaustion that all DAGs that do not require fine-tuning to explain the nodisturbance conditions lead to factorisability, and thereby to KS-noncontextuality.
First note that the no-disturbance conditions, together with the assumption of no fine-tuning, imply that every DAG G for a compatibility phenomenon P must satisfy the d-separation conditions (A ⊥ ⊥ Y |X) d and (B ⊥ ⊥ X|Y ) d . We thus proceed by excluding every DAG that does not satisfy these conditions, and showing that all remaining DAGs imply factorisability.
The class of DAGs we need to consider are those that include latent variables as common causes for observable variables, or direct causal connections between variables. There is no point considering latent variables as intermediaries between variables, or as common effects of variables, since adding those has no effect on the allowed probability distributions over the observable variables.
To aid the proofs, we introduce the graphical notation in Figs. 3-6 to represent sets of causal connections. Step 1: From the d-separation condition (A ⊥ ⊥ Y |X) d , we can exclude any direct causal link or common cause between A and Y (i.e. all edges of the kind shown in Fig. 6). Likewise from (B ⊥ ⊥ X|Y ) d , we can exclude any direct causal link or common cause between B and X. Taken together, these exclude common causes between any three or all four of the variables. We are left with the  Step 2a: Next, we exclude a direct causal link from A to B (with or without a common cause between those two variables). First, note that the assumption of such a link excludes any causal link between A and X, as those would violate (B ⊥ ⊥ X|Y ) d , and a direct link from B to Y , as this would violate (A ⊥ ⊥ Y |X) d . The remaining class of graphs compatible with a direct link from A to B can now have any connection between X and Y plus any link with no direct cause from B to Y (Fig. 8). Step 2b: We now exclude a common cause between B and Y acting together with a direct link from X to Y and/or a common cause between X and Y ; those graphs would violate (B ⊥ ⊥ X|Y ) d as they are colliders. There are now two classes of graphs compatible with a direct link between A and B: i) any link between X and Y and a direct link from Y to B; or ii) a direct link from Y to X plus any link with no direct cause from B to Y . (Fig. 9). Step 2c: We proceed to show that all phenomena compatible with the two classes of graphs remaining after Step 2b are factorisable. To see this, first note that both classes of graphs i) and ii) above respect the following d-separation conditions: (AB ⊥ ⊥ X|Y ) d and (A ⊥ ⊥ Y ) d . This means that all distributions compatible with those graphs must respect P (AB|XY ) = P (AB|Y ) and P (A|Y ) = P (A). The first conditional independence implies that the joint distribution of A and B doesn't depend on the choice of measurement X, which intuitively should imply that this phenomenon cannot be contextual. To see this formally, note that from the definition of conditional probability and the two equations above we get P (AB|XY ) = P (B|AY )P (A|Y ) = P (B|AY )P (A). Now let Λ be a variable that determines A, so that P (A) = Λ P (Λ)P (A|Λ) and P (B|AY ) = P (B|Y Λ). Then P (AB|XY ) = Λ P (Λ)P (A|Λ)P (B|Y Λ), which is a factorisable model with no dependence on X.
This concludes the part of the proof excluding a direct causal link from A to B. By symmetry we exclude any direct causal link from B to A. The remaining class of graphs now can have a common cause between A and B and any causal link between the pairs {X, A}, {Y, B} and {X, Y } (Fig. 10). Step 3: We now proceed to exclude, from the remaining graphs, any direct cause from A to X (a retrocausal model). First we see that, assuming such link, a common cause between A and B is excluded from (B ⊥ ⊥ X|Y ) d . Next, any link between X and Y except X → Y is excluded from (A ⊥ ⊥ Y |X) d . Finally, any link between Y and B except Y → B is excluded by (B ⊥ ⊥ X|Y ) d .

X Y A B
The remaining graph has four possible links: a common cause between A and X, A → X, X → Y and Y → B. This implies that (B ⊥ ⊥ AX|Y ) and (A ⊥ ⊥ Y |X). Thus P (AB|XY ) = P (B|AXY )P (A|XY ) = P (A|X)P (B|Y ), which is trivially factorisable. By symmetry, we eliminate any direct cause from B to Y . Fig. 11: Elimination of DAGs in step 3.
Step 4: After step 3 we are left with the following class of graphs (Fig. 12): a possible common cause (let's call it Λ) between A and B, any link between X and Y , no direct cause A → X and no direct cause B → Y .

X Y
A B In the next step we exclude, from (A ⊥ ⊥ Y |X) d , any common cause A ↔ X acting together with X ↔ Y and/or X ← Y . Likewise from (B ⊥ ⊥ X|Y ) d , we exclude any common cause B ↔ Y acting together with X ↔ Y and/or X → Y . Step 5: We are finally left with three remaining classes of graphs. All three classes allow for A ↔ B, X → A and Y → B, and respectively i) X ↔ A, X → Y ; ii) Y ↔ B X ← Y ; and iii) any link between X and Y . All of these have Λ as a free variable and imply the conditional independences (A ⊥ ⊥ BY |XΛ) and (B ⊥ ⊥ AX|Y Λ). So P (AB|XY ) = Λ P (Λ)P (A|BXY Λ)P (B|XY Λ) = Λ P (Λ)P (A|XΛ)P (B|Y Λ), which is of factorisable form. This completes the proof.