Experimentally bounding deviations from quantum theory in the landscape of generalized probabilistic theories

Many experiments in the field of quantum foundations seek to adjudicate between quantum theory and speculative alternatives to it. This requires one to analyze the experimental data in a manner that does not presume the correctness of the quantum formalism. The mathematical framework of generalized probabilistic theories (GPTs) provides a means of doing so. We present a scheme for determining which GPTs are consistent with a given set of experimental data. It proceeds by performing tomography on the preparations and measurements in a self-consistent manner, i.e., without presuming a prior characterization of either. We illustrate the scheme by analyzing experimental data for a large set of preparations and measurements on the polarization degree of freedom of a single photon. We find that the smallest and largest GPT state spaces consistent with our data are a pair of polytopes, each approximating the shape of the Bloch Sphere and having a volume ratio of $0.977 \pm 0.001$, which provides a quantitative bound on the scope for deviations from quantum theory. We also demonstrate how our scheme can be used to bound the extent to which nature might be more nonlocal than quantum theory predicts, as well as the extent to which it might be more or less contextual. Specifically, we find that the maximal violation of the CHSH inequality can be at most $1.3\% \pm 0.1$ greater than the quantum prediction, and the maximal violation of a particular inequality for universal noncontextuality can not differ from the quantum prediction by more than this factor on either side. The most significant loophole in this sort of analysis is that the set of preparations and measurements one implements might fail to be tomographically complete for the system of interest.

Despite the empirical successes of quantum theory, it may one day be supplanted by a novel, post-quantum theory. 1 Many researchers have sought to anticipate what such a theory might look like based on theoretical considerations, in particular, by exploring how various natural physical principles narrow down the scope of possibilities in the landscape of all physical theories (see [1] and references therein). In this article, we consider a complementary problem: how to narrow down the scope of possibilities directly from experimental data.
Most experiments in the field of quantum foundations aim to adjudicate between quantum theory and some speculative alternative to it. They seek to constrain (and perhaps uncover) deviations from the quantum predictions. Although a few proposed alternatives to quantum theory can be articulated within the quantum formalism itself, such as models which posit intrinsic decoherence [2][3][4][5], most are more radical. Examples include Almost Quantum Theory [6,7], theories with higher-order interference [9][10][11][12][13][14] (or of higher-order in the sense of Ref. [8]), and modifications to quantum theory involving the quaternions [15][16][17][18].
In order to assess whether experimental data provides any evidence for a given proposal (and against quantum theory), it is clearly critical that one not presume the correctness of quantum theory in the analysis. Therefore it is inappropriate to use the quantum formalism to model the experiment. A more general formalism is required. Furthermore, it would be useful if rather than implementing dedicated experiments for each proposed alternative to quantum theory, one had a technique for directly determining the experimentally viable regions in the landscape of all possible physical theories. The framework of generalized probabilistic theories (GPTs) provides the means to meet both of these challenges.
This framework adopts an operational approach to describing the content of a physical theory. It has been developed over the past fifteen years in the field of quantum foundations (see [8,[19][20][21], as well as [7,[22][23][24][25][26][27][28][29]), continuing a long tradition of such approaches [30][31][32][33]. It is operational because it takes the content of a physical theory to be merely what it predicts for the probabilities of outcomes of measurements in an experiment.
The GPT framework makes only very weak assumptions which are arguably unavoidable if an operationalist's conception of an experiment is to be meaningful. One is that experiments have a modular form, such that one part of an experiment can be varied independently of another, such as preparations and measurements for instance; another is that it is possible to repeat a given experimental configuration in such a way that it constitutes an i.i.d. source of statistical data. Beyond this, however, it is completely general. It has been used extensively to provide a common language for describing and comparing abstract quantum theory, classical probability theory, and many foils to these, including quantum theory over the real or quaternionic fields [18], theories with higher-order interference [34][35][36], and the generalized nosignalling theory (also known as Boxworld) [19,26].
Using this framework, we propose a technique for analyzing experimental data that allows researchers to overcome their implicit quantum bias-the tendency of viewing all experiments through the lens of quantum concepts and the quantum formalism-and take a theory-neutral perspective on the data.
Despite the fact that the GPT formalism is ideally suited to the task, to our knowledge, it has not previously been applied to the analysis of experimental data (with the exception of Ref. [37], which applied it to an experimental test of universal noncontextuality and which inspired the present work).
In this paper, we aim to answer the question: given specific experimental data, how does one find the set of GPTs that could have generated the data? We call this the "GPT inference problem". Solving the problem requires implementing the GPT analogue of quantum tomography. Quantum tomography experiments that have sought to characterize unknown states have typically presumed that the measurements are already well-characterized [38][39][40][41][42][43][44], and those that have sought to characterize unknown measurements have typically presumed that the states are known [45,46]. If one has no prior knowledge of either the states or the measurements, then one requires a tomography scheme that can characterize them both based on their interplay. We call such a tomographic scheme self-consistent. To solve the GPT inference problem, we introduce such a self-consistent tomography scheme within the framework of GPTs.
We also illustrate the use of our technique with an experiment on the polarization degree of freedom of a single photon. For each of a large number of preparations, we perform a large number of measurements, and we analyze the data using our self-consistent tomography scheme to infer a GPT characterization of both the preparations and the measurements.
To clarify what, precisely, our analysis implies, we begin by distinguishing two ways in which nature might deviate from the predictions of quantum theory within the framework of GPTs. The first possibility is that it ex-hibits a deviation (relative to what quantum theory predicts for the system of interest) in the particular shapes of the spaces of GPT state vectors and GPT effect vectors but no deviation in the dimensionality of the GPT vector space. The second possibility is that it deviates from quantum expectations even in the dimensionality.
From our experimental data, we find no evidence of either sort of deviation. If nature does exhibit deviations and these are of the first type (i.e., deviations to shapes but not to dimensions), then we are able to put quantitative bounds on the degree of such deviations. If nature exhibits deviations of the second type (dimensional deviations), then although our GPT inference technique may fail to detect them in a given experiment, it does provide an opportunity for doing so. In the next few paragraphs, we try to explain the precise sense in which there is such an opportunity.
If dimensional deviations from quantum theory happen to only be significant for some exotic new types of preparations and measurements, then insofar as our experiment only probes a photon's polarization in conventional ways (using waveplates and beamsplitters), there is nothing in its design ensuring that such deviations will be found. Nonetheless, it is still the case that our experiment (and any other that implements our technique on data obtained by probing a system in conventional ways) has an opportunity to discover such deviations, even in the absence of any knowledge of the type of exotic procedures required to make such deviations significant. To see why this is the case, note that there are two ways in which an experiment might discover new physics: the "terra nova" strategy, wherein one's experiment probes a new phenomenon or regime of some physical quantity, and the "precision" strategy, wherein one's experiment achieves increased precision for a previously explored phenomenon or regime.
To illustrate the distinction, consider a counterfactual history of physics, wherein the special theory of relativity was not discovered by theoretical considerations but was instead inferred primarily from experimental discoveries. Imagine, for instance, that it began with the discovery of corrections to the established (nonrelativistic) formulas for properties of moving bodies, such as the expression for their kinetic energy or the Doppler shift of the radiation they emit. On the one hand, an experimenter who, for whatever reason, had found herself investigating the behaviour of systems accelerated to speeds that were a significant fraction of the speed of light (without necessarily even knowing that the speed of light was a limit) would have found significant deviations from various nonrelativistic formulas. On the other hand, an experimenter who probed systems at unexceptional speeds, (i.e., speeds small compared to the speed of light), but with a degree of precision much higher than had been previously achieved could still have discovered the inadequacy of nonrelativistic formulas by detecting small but statistically significant deviations from these.
The experiment we report provides an opportunity to discover a deviation (from quantum theory) in the dimension of the GPT vector space required to describe photon polarization because it provides a precision characterization of a large set of preparations and measurements thereon. If experimental set-ups designed to realize conventional preparations and measurements inadvertently extend some small distance into the space of exotic preparations and measurements, say, by fluctuations or small systematic effects, then our technique can reveal this fact by showing that the expected dimensionality for the GPT vector space does not fit the data. The full scope of possible preparations and measurements for photon polarization might be radically different from what our quantum expectations dictate (incorporating new exotic procedures), and yet one could, by serendipity, experimentally realize a set of preparations and measurements that are tomographically complete for this full set rather than being merely sufficient for characterizing the conventional procedures. In other words, the realized set could manage to span the full post-quantum GPT vector space in spite of their not having been designed to do so. In Section III A, we provide a more detailed discussion of this point. 2 Applying our GPT inference technique to our experimental data, we find that our experiment is best represented by a GPT of dimension 4, which is what quantum theory predicts to be the appropriate dimension for photon polarization. In other words, we find no evidence for a deviation in the dimension of the GPT vector space, relative to quantum expectations, at the precision frontier using conventional means of probing photon polarization. We can therefore conclude that one of the following possibilities must hold: (i) there are no dimensional deviations, (ii) there are dimensional deviations, which exotic preparations and measurements would reveal, but the procedures realized in our experiment contain strictly no exotic component, (iii) there are dimensional deviations, which exotic preparations and measurements would reveal, and the procedures realized in our experiment do contain some exotic component, but the latter is not visible at the level of precision achieved in our experiment.
We now describe what further conclusions we can draw from our experiment supposing that the realized preparations and measurements in our experiment are tomographically complete, that is, supposing that they have nontrivial components in all dimensions of the GPT vector space describing photon polarization and that these components are visible at the level of precision achieved in our experiment. In other words, we now describe what further conclusions we can draw from our experiment if we suppose that it is possibility (i), rather than possibilities (ii) or (iii), that holds. In this case, we are able to place bounds (at the 1% level) on how much the state and effect spaces of the true GPT might deviate from those predicted by quantum theory. In addition, we are able to draw explicit quantitative conclusions about three types of such putative deviations, which we now outline.
The no-restriction hypothesis [20] asserts that if some measurement is logically possible (i.e., it gives positive probabilities for all states in the theory) then it should be physically realizable. It is true of quantum theoryindeed, it is a popular axiom in many axiomatic reconstructions thereof. A failure of the no restriction hypothesis, therefore, constitutes a departure from quantum theory. We put quantitative bounds on the possible degree of this failure, that is, on the potential gap between the set of measurements that are physically realizable and those that are logically possible. Recalling the scope of possible conclusions (i)-(iii) above, the only way for any future experiment to overturn this conclusion about deviations from the no-restriction hypothesis is if it demonstrated the need for dimensional deviations.
We can also put an upper bound on the amount by which nature might violate Bell inequalities in excess of the amount predicted by quantum theory. Specifically, for the CHSH inequality [47], we show that, for photon polarization, any greater-than-quantum degree of violation is no more than 1.3% ± 0.1 higher than the quantum bound. To our knowledge, this is the first proposal for how to obtain an experimental upper bound on the degree of Bell inequality violation in nature. The only possibility for a future experiment on photon polarization to violate the quantum bound by more than 1.3% ± 0.1 is if it demonstrated the need for dimensional deviations.
In a similar vein, we consider noncontextuality inequalities. These are akin to Bell inequalities, but test the hypothesis of universal noncontextuality [48] rather than local causality. Here, our technique provides both an upper and a lower bound on the degree of violation. For a particular noncontextuality inequality, described in Ref. [49], we find that the true value of the violation is no more than 1.3% ± 0.1 higher and no less than 1.3% ± 0.1 lower than the quantum bound. As with Bell inequalities, the only way for any future experiment on photon polarization to find a violation outside this range is if it demonstrated the need for dimensional deviations.
Although we have not here sought to implement any terra nova strategy for finding deviations from quantum theory, any future experiment that aims to do so can make use of our GPT inference technique to analyze the data and evaluate the evidence. Inasmuch as terra nova strategies, relative to precision strategies, provide a complementary (and presumably better) opportunity for finding new physics, our GPT inference technique is also significant insofar as it provides the means to analyze such experiments.

II. THE FRAMEWORK OF GENERALIZED PROBABILISTIC THEORIES
A. Basics For any system, in any physical theory, there will in general be many possible ways for it to be prepared, transformed, and measured. Here, each preparation procedure, transformation procedure and measurement procedure is conceived as a list of instructions for what to do in the laboratory. The different combinations of possibilities for each procedure defines a collection of possible experimental configurations. We will here restrict our attention to experimental configurations of the prepareand-measure variety: these are the configurations where there is no transformation intervening between the preparation and the measurement and where the measurement is terminal (which is to say that the system does not persist after the measurement). We further restrict our attention to binary-outcome measurements.
A GPT aims to describe only the operational phenomenology of a given experiment. In the case of a prepare-and-measure experiment, it aims to describe only the relative probabilities of the different outcomes of each possible measurement procedure when it is implemented following each possible preparation procedure. For binary-outcome measurements, it suffices to specify the probability of one of the outcomes since the other is determined by normalization. If we denote the outcome set {0, 1}, then it suffices to specify the probability of the event of obtaining outcome 0 in measurement M . This event will be termed an effect and denoted [0|M ].
Thus a GPT specifies a probability p(0|P, M ) for each preparation P and measurement M . Denoting the cardinality of the set of all preparations (respectively all measurements) by m (respectively n), the set of these probabilities can be organized into an m × n matrix, denoted D, where the rows correspond to distinct preparations and the columns correspond to distinct effects, We refer to D as the probability matrix associated to the physical theory. Because it specifies the probabilities for all possibilities for the preparations and the measurements, it contains all of the information about the putative physical theory for prepare-and-measure experiments. 3 Defining k ≡ rank(D) then one can factor D into a product of two rectangular matrices, where S is an (m × k) matrix and E is a (k × n) matrix.
Denoting the ith row of S by the row vector s T Pi (where T denotes transpose) and the jth column of E by the column vector e [0|Mj ] , we can write so that Factoring D in this way allows us to associate to each preparation P a k-dimensional vector s P and to each effect [0|M ] a k-dimensional vector e [0|M] such that the probability of obtaining the effect [0|M ] on the preparation P is recovered as their inner product, p(0|P, M ) = s P ·e [0|M] . The vectors s P and e [0|M] will be termed GPT state vectors and GPT effect vectors respectively. A particular GPT is specified by the sets of all allowed GPT state and effect vectors, denoted by S and E, respectively. Because the n GPT effect vectors associated to the set of all measurement effects lie in a k-dimensional vector space, only k of them are linearly independent. Any set of k measurement effects whose associated GPT effect vectors form a basis for the space will be termed a tomographically complete set of measurement effects. The terminology stems from the fact that if one seeks to deduce the GPT state vector of an unknown preparation from the probabilities it assigns to a set of characterized measurement effects (the GPT analogue of quantum state tomography) then this set of GPT effect vectors must form a basis of the k-dimensional space. Similarly, any set of k preparations whose associated GPT state vectors form a basis for the space will be termed tomographically complete because to deduce the GPT effect vector of an unknown measurement effect from the probabilities assigned to it by a set of known preparations, the GPT state vectors associated to the latter must form a basis.
For any GPT, we necessarily have that the rank of D satisfies k ≤ min{m, n}, but in general, we expect k to be much smaller than m or n.
the complete information about the physical theory is given by the function f (x, y) := p(0|Px, My). The GPT is a theoretical abstraction, so it is acceptable if it is presumed to contain such continua.
There is a freedom in the decomposition of Eq. (1). Specifically, for any invertible (k × k) matrix R, we have D = SE = (SR −1 )(RE). Thus, there are many decompositions of D of the type described. Note that any basis of the k-dimensional vector space remains so under a linear transformation, so the property of being tomographically complete is independent of the choice of representation.
It is worth noting that for any physical theory, the GPT framework provides a complete description of its operational predictions for prepare-and-measure experiments. In this sense, the GPT framework is completely general. Furthermore, one can show that under a very weak assumption it provides the most efficient description of the theory, in the sense that it is a description with the smallest number of parameters. The weak assumption is that it is possible to implement arbitrary convex mixtures of preparations without altering the functioning of each preparation in the mixture, so that for any set of GPT state vectors that are admitted in the theory, all of the vectors in their convex hull are also admitted in the theory. See Theorem 1 of Ref. [23] for the proof.
We will here make this weak assumption and restrict our attention to GPTs wherein any convex mixture of preparation procedures is another valid preparation procedure, so that the set of GPT state vectors is convex [8]. We refer to the set S of GPT states in a theory as its GPT state space. We also make the weak assumption that any convex mixture of measurements and any classical post-processing of a measurement is another valid measurement. This implies that the set of GPT effect vectors consists of the intersection of two cones, which can be described as follows: there is some set of rayextremal GPT effect vectors, such that the first cone is the convex hull of all positive multiples of these vectors, and the second cone is the set of vectors which can be summed with a vector in the first cone to yield the unit effect vector u (defined below). (This ensures that if a given effect e is in the GPT, then so is the complementary effectē := u − e.) We will use the term "diamond" to describe this sort of intersection of two cones, and we refer to the set E of GPT effects in a theory as its GPT effect space.
It is worth noting that although GPTs which fail to be closed under convex mixtures and classical postprocessing are of theoretical interest -there are interesting foils to quantum theory of this type [48,51] -one does not expect them to be candidates for the true GPT describing nature because there seems to be no obstacle in practice to mixing or post-processing procedures in an arbitrary way. To put it another way, the evidence suggests that the GPT describing nature must include classical probability theory as a subtheory, thereby pro-viding the resources for implementing arbitrary mixtures and post-processings.
Distinct physical theories (i.e., distinct GPTs) are distinguished by the shapes of the GPT state space and the GPT effect space, where these shapes are defined up to a linear transformation, as described earlier.
We end by highlighting some conventions we adopt in representing GPTs. Define the unit measurement effect as the one which occurs with probability 1 for all preparations (it is represented by a column of 1s in D), and denote it by u. Because each s P will have an inner product of 1 with u (by normalization of probability), it follows that there are only k − 1 free parameters in the GPT state vector. We make a conventional choice (i.e., a particular choice within the freedom of linear transformations) to represent the unit effect by the GPT effect vector (1, 0, 0, . . . ) T . This choice forces the first component of all of the GPT state vectors to be 1. In this case, one can restrict the search for factorizations D = SE to those for which the first column of S is a column of 1s. It also follows that the projection of all GPT state vectors along one of the axes of the k-dimensional vector space has value 1, and consequently it is useful to only depict the projection of the GPT state vectors into the complementary (k−1)-dimensional subspace.

B. Examples
Some simple examples serve to clarify the notion of a GPT. First, consider a 2-level quantum system (qubit). The set of all preparations is represented by the set of all positive trace-one operators on a 2-dimensional complex Hilbert space, that is, ρ ∈ L(C 2 ) with L denoting the linear operators, such that ρ ≥ 0 and Tr(ρ) = 1. Each measurement effect is associated with a positive operator less than identity, 0 ≤ Q ≤ I. Each measurement effect and each preparation can also be represented by a vector in a real 4-dimensional vector space by simply decomposing the operators representing them relative to any orthonormal basis of Hermitian operators. The Born rule is reproduced by the vector space inner product because it is simply the inner product of the associated operators relative to the Hilbert-Schmidt norm.
The most common example of such a representation is the one that uses (a scalar multiple of) the four Pauli operators, { 1 2 I, 1 2 σ x , 1 2 σ y , 1 2 σ z }, as the orthonormal basis of the space of operators. A preparation represented by a density operator ρ is associated with the 4-dimensional real vector s ≡ (s 0 , s 1 , s 2 , s 3 ), via the relation ρ = 1 2 s · σ, where σ ≡ (I, σ x , σ y , σ z ), or equivalently, ρ = 1 2 (s 0 I + s 1 σ x + s 2 σ y + s 3 σ z ). The condition Tr(ρ) = 1 implies that s 0 = 1, and the conditions Tr(ρ) = 1 and ρ ≥ 0 together imply that s 2 1 + s 2 2 + s 2 3 ≤ 1. Consequently, there is only a 3-dimensional freedom in specifying a quantum state. Geometrically, the possible s describe a ball of radius 1, conventionally termed the Bloch Sphere 4 and depicted in Fig. 1(a)(i). A measurement effect represented by an operator Q is associated with the 4-dimensional real vector e ≡ (e 0 , e 1 , e 2 , e 3 ), via the relation Q = e · σ. The conditions Q ≥ 0 and Q ≤ I imply that 0 ≤ e 0 ≤ 1, e 2 1 + e 2 2 + e 2 3 ≤ e 0 and e 2 1 + e 2 2 + e 2 3 ≤ 1 − e 0 , which constrains e to lie within the intersection of two four-dimensional cones, which we refer to as the Bloch Diamond and depict via a pair of three-dimensional projections in Fig. 1(a)(ii)-(iii) 5 .
As noted in the discussion of the GPT framework, this geometric representation of the quantum state and effect spaces is only one possibility among many. If we define a linear transformation of the state space by any invertible 4 × 4 matrix and we take the corresponding inverse linear transformation on the effect space, the new state and effect spaces will also provide an adequate representation of all prepare-and-measure experiments on a single qubit. (Note that implementing a linear transformation of this form is equivalent to representing quantum states and effects with respect to a different basis of Hermitian operators.) Classical probabilistic theories can also be formulated within the GPT framework. Consider the simplest case of a classical system with two possible physical states, i.e., a classical bit, for which k = 2. The set of possible preparations of this system is simply the set of normalized probability distributions on a bit , µ = (µ 0 , µ 1 ), where 0 ≤ µ 0 , µ 1 ≤ 1 and µ 0 + µ 1 = 1. The most general measurement effect is a pair of probabilities, specifying the probability of that effect occuring for each value of the bit, that is, ξ = (ξ 0 , ξ 1 ) where 0 ≤ ξ 0 , ξ 1 ≤ 1. The probability of a particular measurement effect occuring when implemented on a particular preparation is clearly just the inner product of these, µ · ξ. The positivity and normalization constraints imply that the convex set of state vectors describes a line segment from (1, 0) to (0, 1), and the set of effect vectors is the square region with vertices (0, 0), (1, 0), (0, 1) and (1,1).
For ease of comparison with our examples of GPTs, it is useful to consider a linear transformation of this representation, corresponding geometrically to a rotation by 45 degrees. We represent each preparation by a state vector s = (1, s 1 ), where −1 ≤ s 1 ≤ 1, and each measurement effect by an effect vector e = (e 0 , e 1 ) where −1/2 ≤ e 1 ≤ 1/2 and e 0 ≥ |e 1 | and e 0 ≤ 1 − |e 1 | (with the experimental probabilities still given by their inner product, s · e). The convex set of these state vectors can then be depicted as a horizontal line segment, and the 4 Strictly speaking, however, it should be called the Bloch Ball. 5 Note that the relation we assume to hold between a qubit measurement effect Q and the Bloch vector e representing it, namely, Q = e · σ, differs from the standard convention used in quantum information theory by a factor of set of effect vectors by a diamond with a line segment at its base, as in Fig. 1(b). This representation makes it clear that the state and effect spaces of a classical bit are contained within those of a qubit (as the quantum states and effects whose representation as operators are diagonal in some fixed basis of the Hilbert space). One can also consider GPTs that are neither classical nor quantum. In the GPT known as "Boxworld" [19,26] (originally called "generalized no-signalling theory"), correlations can be stronger than in quantum theory, violating Bell inequalities by an amount in excess of the maximum quantum violation. The k = 3 system in Boxworld, known as the "generalized no-signalling bit", has received a great deal of attention. A pair of such systems can generate the stronger-than-quantum correlations known as a Popescu-Rohrlich box [52] from which the name Box-world derives. These achieve a CHSH inequality violation equal to the algebraic maximum. Such correlations are achievable in Boxworld because there are some states that respond deterministically to multiple effects, and there are also some effects that respond deterministically to multiple states. Boxworld also has a k = 4 system, which shares features of the generalized no-signalling bit and is, in certain respects, more straightforward to compare to a qubit. It is the latter that we depict in Fig. 1(c).
Another alternative to classical and quantum theories is the toy theory introduced by one of the authors [53]. We here consider a variant of this theory, wherein one closes under convex combinations. The simplest system has k = 4 and has the state and effect spaces depicted in Figure 1(d). 6 Finally, Fig. 1(e) illustrates a generic example of a GPT with k = 4. We constructed this GPT by generating a rank 4 matrix of random probabilities, and found GPT representations of the state and effect spaces from that.
In this paper, we describe a technique for estimating the GPT state and effect spaces that govern nature directly from experimental data. The examples described above illustrate the diversity of forms that the output of our technique could take.

C. Dual spaces
Finally, we review the notion of the dual spaces of GPT state and effect spaces. We will call a vector s ∈ R k a logically possible state if it assigns a valid probability to every measurement effect allowed by the GPT. Mathematically, the space of logically possible states, denoted S logical , contains all s ∈ R k such that ∀e ∈ E : 0 ≤ s · e ≤ 1 and such that s · u = 1. From this definition, it is clear that S logical is the intersection of the geometric dual of E and the hyperplane defined by s · u = 1; as a shorthand, we will refer to S logical simply as "the dual of E", and denote the relation by S logical ≡ dual(E). Analogously, the set of logically possible effects, denoted E logical , contains all e ∈ R k such that ∀s ∈ S : 0 ≤ s · e ≤ 1. Defining the set of subnormalized states byŜ ≡ {ws : s ∈ S, w ∈ [0, 1]}, E logical is the geometric dual ofŜ. For simplicity, we will refer to E logical simply as "the dual of S", and denote the relation by E logical ≡ dual(S).
GPTs in which S logical = S and E logical = E (the two conditions are equivalent) are said to satisfy the norestriction hypothesis [20]. In a theory that satisfies the no-restriction hypothesis, every logically allowed GPT effect vector corresponds to a physically allowed measurement, and (equivalently) every logically allowed GPT state vector corresponds to a physically allowed preparation. In theories wherein S logical = S and E logical = E, by contrast, there are vectors that do not correspond to physically allowed states but nonetheless assign valid probabilities to all physically allowed effects, and there are vectors that do not correspond to physically allowed effects but are nonetheless assigned valid probabilities by all physically allowed states.
For each of the examples in Fig. 1, we have depicted the dual to the effect space alongside the state space and the dual of the state space alongside the effect space, as wireframes. Quantum theory, classical probability theory, and Boxworld provide examples of GPTs that satisfy the no-restriction hypothesis, as illustrated in Fig. 1(a),(b),(c), while the GPTs presented in Fig. 1(d),(e) are examples of GPTs that violate it.

D. The GPT inference problem
The true GPT state and effect spaces, S and E, are theoretical abstractions, describing the full set of GPT state and effect vectors that could be realized in principle if one could eliminate all noise. However, the ideal of noiselessness is never achieved. Therefore, the GPT state and effect vectors describing the preparation and measurement effects realized in any experiment are necessarily bounded away from the extremal elements of S and E. Geometrically, the realized GPT state and effect spaces will be contracted relative to their true counterparts.
There is another way in which the experiment necessarily differs from the theoretical abstraction: it may be impossible for the set of experimental configurations in a real experiment to probe all possible experimental configurations allowed by the GPT. For instance, for quantum theory there are an infinite number of convexly extremal preparations and measurements even for a single qubit, while a real experiment can only implement a finite number of each.
Because we assume convex closure, the realized GPT state and effect spaces will be polytopes. If the experiment probes a sufficiently dense sample of the preparations and measurements allowed by the GPT, then the shapes of these polytopes ought to resemble the shapes of their true counterparts.
We term the convex hull of the GPT states that are actually realized in an experiment the realized GPT state space, and denote it by S realized . Because every preparation is noisier than the ideal version thereof, this will necessarily be strictly contained within the true GPT state space S. Similarly, we term the diamond defined by the GPT measurement effects that are actually realized in an experiment the realized GPT effect space, and denote it E realized . Again, we expect it to be strictly contained within E. By dualization, S realized defines the set of GPT effect vectors that are logically consistent with the realized preparations, which we denote by E consistent , that is, E consistent ≡ dual(S realized ). Similarly, the set of GPT state vectors that are logically consistent with the realized measurement effects is S consistent ≡ dual(E realized ).
Suppose one has knowledge of the realized GPT state and effect spaces S realized and E realized for some experiment. What can one then infer about S and E? The answer is that S can be any convex set of GPT states that lies strictly between S realized and S consistent . For every such possibility for S, E could be any diamond of GPT effects that lies between E realized and dual(S) ⊂ E consistent . These inclusion relations are depicted in Fig. 2.
The larger the gap between S realized and S consistent , the more choices of S and E there are that are consistent with the experimental data. An example helps illustrate the point. Suppose that one found S realized and E realized to be the GPT state and effect spaces depicted in Fig. 1(d). In this case S realized is represented by the blue octahedron in Fig. 1(d)(i), and E realized is the green diamond with an octahedral base depicted in Fig. 1(d)(ii-iii). The wireframe cube in Fig. 1(d)(i) is the space of states S consistent that is the dual of E realized , and the wireframe diamond with a cubic base in Fig. 1(d)(ii-iii) is the space of effects E consistent that is the dual of S realized . Which GPTs are candidates for the true GPT in this case? The answer is: those whose state space contains the blue octahedron and is contained by the wireframe cube in Fig. 1(d)(i) and whose effect space contains the green diamond with the octohedral base in Fig. 1(d)(ii)-(iii) (the consistency of the effect space with the state space is a given if one grants that the pair is a valid GPT). By visual inspection of Fig. 1(a) and Fig. 1(c), it is clear that the GPTs representing both quantum theory and Boxworld are consistent with this data. The GPT for a classical 4-level system (i.e. the k = 4 generalization of the classical bit in Fig. 1(b) [28]) is as well.
When there is a large gap between S realized and S consistent , it is important to consider the possibility that this is due to a shortcoming in the experiment and that probing more experimental configurations will reduce it. For instance, if an experiment on a 2-level system was governed by quantum theory, but the experimenter only considered experimental configurations involving eigenstates of Pauli operators, then S realized and E realized would be precisely those of the example we have just described (depicted in Fig. 1(d)), implying many possibilities besides quantum theory for the true GPT. However, further experimentation would reveal that this seemingly large scope for deviations from quantum theory was merely an artifact of probing a too-sparse set of configurations. Only if one continually fails to close the gap between S realized and S consistent , in spite of probing the greatest possible variety of experimental configurations, should one consider the possibility that in fact S ≃ S realized and E ≃ E realized and that the true GPT fails to satisfy the norestriction hypothesis. By contrast, if the gap between S realized and S consistent is very small, the experiment has The GPT specifies a space of true states, S, and effects, E . From these, one can find the sets of logically possible states, S logical , and effects E logical . E logical is the dual of S, and it represents all effects which return probabilities between 0 and 1 when applied to every possible state in S. Similarly, S logical is the dual of E . The logical state (effect) space must always contain the true state (effect) space. The spaces S realized and E realized are the GPT representations of the preparations and measurement effects actually realized in the experiment. As any real experiment necessarily contains a finite amount of noise, S realized will always be contained within S, and E realized will always be contained within E . Econsistent is the dual of S realized (and thus will always contain E logical ), and it represents all effects that are logically consistent with the set of states realized in the experiment. Similarly, Sconsistent will always contain S logical as it is the dual of E realized .
found a tightly constrained range of possibilities for the true GPT, and it successfully rules out a large class of alternative theories.

III. SELF-CONSISTENT TOMOGRAPHY IN THE GPT FRAMEWORK
We have just seen that any real experiment defines a set of realized GPT states, S realized , and a set of realized GPT effects, E realized , and it is from these that one can infer the scope of possibilities for the true spaces, S and E, and thus the scope of possibilities for deviations from quantum theory.
But how can one estimate S realized and E realized from experimental data? In other words, how can one implement tomography within the GPT framework? This is the problem whose solution we now describe. The steps in our scheme are outlined in Fig. 3.

A. Tomographic completeness and the precision strategy for discovering dimensional deviations
In the introduction, we distinguished two ways in which the true GPT describing a given degree of freedom might deviate from quantum expectations. The first possibility for deviations was in the shapes of the state and effect spaces, assuming no deviation in the dimension of the GPT vector space in which these are embedded. The second possibility was more radical-a deviation in the dimension. In this section, we evaluate what sort of evidence one can obtain about the dimension of GPT required to model a given degree of freedom.
We presume that there is a principle of individuation for different degrees of freedom, which is to say a way to distinguish what degree of freedom an experiment is probing. For instance, we presume that we can identify certain experimental operations as preparations and measurements of photon polarization and not of some other degree of freedom.
As noted earlier, the dimension of the GPT vector space associated to a degree of freedom is the minimum cardinality of a tomographically complete set of preparations (or measurements) for that degree of freedom. Therefore, for the dimension implied by our data analysis to be the true dimension, the sets of preparations and measurements that are experimentally realized must be tomographically complete for that degree of freedom.
Because one cannot presume the correctness of quantum theory, however, one does not have any theoretical grounds for deciding which sets of measurements (preparations) are tomographically complete for a given system. Indeed, whatever set of preparations (measurements) one considers as a candidate for a tomographically complete set, one can never rule out the possibility that tomorrow a novel variety of preparations (measurements) will be identified whose statistics are not predicted by those in the putative tomographically complete set, thereby demonstrating that the set was not tomographically complete after all. As such, any supposition of tomographic completeness is always tentative.
As Popper emphasized, however, all scientific claims are vulnerable to being falsified and therefore have a tentative status [50]. We are therefore recommending to treat the hypothesis that a given set of measurements and a given set of preparations are tomographically complete as Popper recommends treating any scientific hypothesis: one should try one's best to falsify it and as long as one fails to do so, the hypothesis stands.
As noted in the introduction, it is useful to distinguish between two types of opportunities for falsifying a hy- Overview of the self-consistent GPT tomography procedure. We begin with the experimental data, finite-run relative frequencies for each configuration realized in the experiment, and arrange it into a matrix, F , which is a noisy version of the matrix of true probabilities, D realized . To estimate the dimension, k, of the data, we find the rankk matrix which best fits F for a set of values of k. We call this set of best-fit rank-k matrices the candidate model set. A statistical analysis on the candidate model set (using the χ 2 goodness-of-fit test and the Akaike information criterion) determines the value of k that gives us the best fit, and therefore which of the candidate models is the best approximation to D realized . We denote this best approximation byD realized . We find a decompositionD realized =S realizedẼrealized , in order to estimate the spaces of states and effects realized in the experiment. Each row ofS realized is a GPT state vector representing one of the preparation procedures in the experiment, and each column ofẼ realized is a GPT effect vector representing one of the measurement procedures. This completes the GPT tomography procedure.
pothesis about what sets of preparations and measurements are tomographically complete: terra nova strategies and precision strategies. In this article, we pursue the latter approach. To explain how a precision strategy provides an opportunity for detecting deviations from the quantum prediction for the dimension of the GPT vector space, we offer an illustrative analogy.
Suppose that the GPT describing the world is indeed quantum theory. Now consider an experiment on photon polarization wherein the experimentally realized preparations and measurements are restricted to a real-amplitude subalgebra of the full qubit alebra, that is, a rebit subalgebra. In this case, the realized GPT state and effects correspond, respectively, to a restriction of the Bloch ball in Fig. 1(a)(i) to an equatorial disc and to a restriction of the ball-based diamond in Fig. 1(a)(ii)-(iii) to the diamond with the disc as its base (which is the 3-dimensional projection, depicted in Fig. 1(a)(ii), of the full 4-dimensional qubit effect space).
Suppose an experimenter did not know the ground truth about the GPT describing photon polarization, which by assumption in our example is the GPT associated to the full qubit algebra. If they mistakenly presumed that the preparations and measurements realized in the rebit experiment were tomographically complete, they would be led to a false conclusion about the GPT describing photon polarization. Nonetheless, and this is the point we wish to emphasize, high-precision experimental data provides them with an opportunity for recognizing their mistake.
The key observation is that the only case in which the experimental data contains strictly no evidence of states and effects beyond the restricted subalgebras is if the realized preparations and measurements obey the restriction exactly. However, any real implementation of experimental procedures is necessarily imperfect, and certain types of imperfections (e.g., systematic errors) will result in preparations and measurements that do extend into the higher-dimensional space-in our example, from the rebit spaces into the full qubit spaces, hence from dimension 3 to dimension 4. For instance, they might lead to preparations that were not strictly restricted to an equatorial disc but rather a fattened pancake-shaped subset of the Bloch ball, and similarly for the measurements. The realized preparations and measurements in this case would still be very far from sampling the full qubit state and effect spaces, but they would nonetheless attest to the need for a GPT vector space of dimension 4 rather than one of dimension 3. Of course, if the deviation is small, then one requires a correspondingly small degree of statistical error in the characterization of the state and effect spaces in order to detect it. Hence the need for precision in the characterization of the states and effects.
If, in our imagined example, an experimentalist detected a deviation from their expectations regarding dimensionality in this fashion, they would be prompted to look for new preparations and measurements that might extend further into this 4th dimension. We can easily imagine that, via such a precision-based discovery of an anomaly, the experimentalist could come to learn that what at first appeared to be a rebit was in fact a qubit.
We can now draw the analogy between this sort of example and the experiment we analyze here. Despite the fact that we did not intentionally seek to do anything exotic in our preparations and measurements of photon polarization, it could nonetheless be the case that the GPT vectors representing these had small components in additional dimensions of GPT vector space, beyond the 4 dimensions that quantum theory stipulates as suf-ficient for modelling photon polarization. In this case, our scheme would find that the data is only fit well by a GPT of dimension greater than 4. To the extent that one was confident that the experimental procedures did not inadvertently probe some additional degrees of freedom beyond photon polarization, this would constitute evidence for postquantum physics.
We turn now to describing the self-consistent GPT tomography procedure.
B. Inferring best-fit probabilities from finite-run statistics We suppose that, for a given system, the experimenter makes use of a finite number m of preparation procedures (P i , i ∈ {1, · · · , m}) and a finite number, n, of binaryoutcome measurement procedures (M j , j ∈ {1, · · · , n}). We denote the outcome of each measurement by a ∈ {0, 1}. For each choice of preparation and measurement, (P i , M j ), the experimenter records the outcome of the measurement in a large number of runs and computes the relative frequency with which a given outcome a occurs, denoted f (a|P i , M j ). For the binary-outcome measurements under consideration, it is sufficient to specify The set of all experimental data, therefore, can be encoded in an m × n matrix F , whose The relative frequency f (0|P i , M j ) one measures will not coincide exactly with the probability p(0|P i , M j ) from which it is assumed that the outcome in each run is sampled. 7 Rather, f (0|P i , M j ) is merely a noisy approximation to p(0|P i , M j ). The statistical variation in f (0|P i , M j ) can, however, be estimated from the experiment.
It follows that the matrix F extracted from the experimental data is merely a noisy approximation to the matrix D realized that encodes the predictions of the GPT for the mn experimental configurations of interest. Because of the noise, F will generically be full rank, regardless of the rank of D realized [54]. Therefore, the experimentalist is tasked with estimating the m × n probability matrix D realized given the m × n data matrix F , where the rank of D realized is a parameter in the fit.
We aim to describe our technique in a general manner, so that it can be applied to any experiment. However, 7 Note that it is presumed that the outcome variables for the different runs (on a given choice of preparation and measurement) are identically and independently distributed. This assumption could fail, for instance, due to a drift in the nature of the preparation or measurement over the timescale on which the different runs take place, or due to a memory effect that makes the outcomes in different runs correlated. In such cases, one would require a more sophisticated analysis than the one described here. Pairs of polarization-separable single photons are created via spontaneous parametric down-conversion. The herald photon is sent to a detector. The signal photon's polarization travels through a polarizer then a quarter and half waveplate, which prepares its polarization state. The photon is then coupled into single-mode fibre which removes any information which may be encoded in the photon's spatial degreeof-freedom. Three static waveplates undo the polarization rotation caused by the fibre. Two waveplates and a polarizing beamsplitter with detectors in each output port perform a measurement on the photon. One output port is labelled '0', and the other is labelled '1'. Coincident detections between the herald detector, D h , and the detector in the transmitted port, Dt, are counted, as well as coincidences between D h and the reflected-port detector Dr. PPKTP: Periodicallypoled potassium titanyl phosphate; PBS: Polarizing beamsplitter; GT-PBS: Glan-Thompson polarizing beamsplitter; IF: Interference filter; HWP: Half waveplate; QWP: Quarter waveplate.
in order to provide a concrete example of its use, we will intersperse our presentation of the technique with details about how it is applied to the particular experiment we have conducted. We begin, therefore, by providing the details of the latter.

C. Description of the experiment
To illustrate the GPT tomography scheme, we perform an experiment on the polarization degree of freedom of single photons. Pairs of photons are created via spontaneous parametric down-conversion, and the detection of one of these photons, called the herald, indicates the successful preparation of the other, called the signal. We manipulate the polarization of the signal photons with a quarter-and half-waveplate before they are coupled into a single-mode fibre; each preparation is labelled by the angles of these two waveplates.
Upon emerging from the fibre, the signal photons encounter the measurement stage of the experiment, which consists of a quarter-and half-waveplate followed by a polarizing beam splitter with single-photon detectors at each of its output ports. Each measurement is labelled by the angles of the two waveplates preceding the beam splitter.
The frequency of the 0 outcome is defined as the ratio of the number of heralded signal photon detections in the 0 output port to the total number of heralded detec-tions. We ignore experimental trials in which either the herald or the signal photon is lost by post-selecting on coincident detections, so that our measurements are only performed on normalized states. This is akin to making a fair-sampling assumption, that is, we assume that the statistics of the detected photons are representative of the statistics we would have measured if our experiment had perfect efficiency. Post-selecting on coincident detections has the additional benefit of allowing us to filter out background counts that are caused by, for example, stray room light or "dark" counts from our detectors.
We choose m = 100 waveplate settings for the preparations, and n = 100 waveplate settings for the settings, corresponding to mn = 10 4 experimental configurations in all, one for each pairing.
We choose m = n so that the GPT state space and the GPT effect space are equally well characterized. We detect coincidences at a rate of ∼ 2250 counts/second, and count coincidences for each preparation-measurement pair for a total of eight seconds, allowing us to achieve a standard deviation on each data point below the 1% level. Because of the additional time it takes to mechanically rotate the preparation and measurement waveplates, it takes approximately 84 hours to acquire data for 10 4 preparation-measurement pairs.
Our method of selecting which 100 waveplate settings to use is described in Appendix B. Note that although the choice of these settings is motivated by our knowledge of the quantum formalism, our tomographic scheme does not assume the correctness of quantum theory: our reconstruction scheme could have been applied equally well if the waveplate settings had been chosen at random. 8 The raw frequencies are arranged into the data matrix F . Entry F ij is the frequency at which the 0 outcome was obtained when measurement M j was performed on a photon that was subjected to preparation P i . As noted in Sec. II A, we adopt a convention wherein M 1 is the unit measurement, implying that the first column of F is a column of 1s. The data matrix for our experiment is presented in Fig. 5. As expected, we find that F is full rank.

D. Estimating the probability matrix D realized
We turn now to the problem of estimating from F the m×n probability matrix D realized . The first item of business is to estimate the rank of D realized , which is equivalent to estimating the cardinality of the tomographically complete set of preparations (or measurements) of the GPT model of the experiment. 8 An interesting question for future research is how the quality of the GPT reconstruction varies with the particular set of waveplate settings that are considered. In particular, one can ask about the quality of the evidence for quantum theory in the situation wherein the waveplate settings correspond to sampling highly nonuniformly over the points on the Bloch sphere. For a given hypothesis k about the value of the rank, and for a given data matrix F , we find the rank-k ma-trixD realized that is the maximum likelihood estimate of the rank-k probability matrix D realized that generated F . In other words,D realized is the rank-k matrix that minimizes the weighted χ 2 statistic, defined as This minimization problem is known as the weighted low-rank approximation problem, which is a non-convex optimization problem with no analytical solution [55,56]. Nonetheless, one can use a fitting algorithm based on an alternating-least-squares method [56].
In the algorithm, it is important to constrain the entries ofD realized to lie within the interval [0, 1] so that they may be interpreted as probabilities. Full details are provided in Appendix C.
To estimate the rank of the true model underlying the data, one must compare different candidate model ranks.
(For our experiment, we consider k ∈ {2, 3, . . . , 10}.) For each candidate rank k, one first computes the χ 2 of the maximum-likelihood model of that rank, denoted χ 2 k , in order to determine the extent to which each model might underfit the data. Second, one computes for the maxlikelihood model of each rank the Akaike information criterion (AIC) score [57,58] in order to determine the relative extent to which the various models either underfit or overfit the data.
We begin by describing the method by which one finds the rank-k probability matrixD realized which minimizes χ 2 . Note that an m × n matrix with rank k is specified by a set of r k = k(m + n − k) real parameters [59], thus if the true probability matrix D realized is rank k, then we expect that χ 2 k will be sampled from a χ 2 distribution with mn − k(m + n − k) = (m − k)(n − k) degrees of freedom [60].
For our experiment, we calculate the variances (∆F ij ) 2 in the expression for χ 2 by assuming that the number of detected coincident photons follows a Poissonian distribution. Fig. 6(a) displays the interval containing 99% of the probability density for a χ 2 distribution with (m − k)(n − k) degrees of freedom, as well as χ 2 k , for each value of k ∈ {2, 3, . . . , 10}. For k < 4, χ 2 k lies far outside the expected 99% range, and we rule out these models with high confidence.
The Akaike information criterion assigns a score to each model in a candidate set, termed its AIC score. The Kullback-Leibler (KL) divergence is a measure of the information lost when some probability distribution f is used to represent some other distribution g [61], and the AIC score of a candidate model is a measure of the Kullback-Leibler (KL) divergence between the candidate model and the true model underlying the data. Since the true model isn't known, the KL divergence can't be calculated exactly. What each candidate model's AIC score represents is its KL divergence from the true model, relative to all models in the candidate set. The candidate model with the lowest AIC score is closest to the true model (in the KL sense), and thus it is the most likely representation of the data among the set of candidates.
The AIC scores can be used to determine which model among a set of candidate models is the most likely to describe the data. If AIC k denotes the AIC score of the kth model, and ∆ k denotes the difference between this score and the minimum score among all candidate models, ∆ k := AIC k − min k ′ AIC k ′ , then its AIC weight is defined as w k := e − 1 2 ∆ k / 10 k=2 e − 1 2 ∆ k [61]. The AIC weight w k represents the likelihood that the kth model is the model that best describes the data, relative to the other models in the set of candidate models. In our experiment, the candidate models differ by rank, and the AIC score of a rank-k candidate model is defined as AIC k = χ 2 k + 2r k [61]. The first term rewards models in proportion to how well they fit the data, and the second term penalizes models in proportion to their complexity, as measured by the number of parameters. For our experiment, the set of candidate models is the set of best-fit rank-k models for k ∈ {2, . . . , 10}. We plot the AIC values for each candidate model in Fig. 6(b). AIC k is minimized for k = 4, and we conclude that the true model underlying our dataset is most likely rank 4. The relative likelihood of each candidate model is shown in Fig. 6(c). We find w 4 = 0.9998, w 5 = 1.99 × 10 −4 , and w k < 10 −12 for other values of k.
The χ 2 goodness-of-fit test indicates that the max-likelihood rank-4 model fits the data well, and the AIC test indicates that this same model is the most likely of all nine candidate models to have generated the data, with relative probability 0.9998. We conclude with high confidence that the GPT that best describes our experiment has dimension 4. Recall that it is still possible that the true GPT describing photon polarization has dimension greater than 4 because it is possible that the sets of preparations and measurements we have implemented in our experiment are not tomographically complete for photon polarization.
Nonetheless, the focus of much of the rest of this section and the focus of all of section IV will be to describe what additional conclusions can be drawn from our experimental data if we adopt the hypothesis that the preparations and measurements we realized are, in fact, tomographically complete for photon polarization, with the understanding that this hypothesis could in principle be overturned by future experiments that achieved higher precision or realized an exotic new variety of preparations and measurements for photon polarization. These additional conclusions concern the possibility of deviations from quantum theory in the shape of the state and effect spaces, rather than in the dimension of the vector space in which these are embedded.

E. Estimating the realized GPT state and effect spaces
The realized GPT state space, S realized and the realized GPT state space, E realized define the probability matrix D realized from which the measurement outcomes in the experiment are sampled.
As noted above, the matrixD realized for the rank-4 fit provides our best estimate of the true probability matrix D realized . To obtain an estimate of the realized GPT state and effect spaces fromD realized , we must decompose it in the manner described in Sec. II A, that is, asD realized = S realizedẼrealized .
Recall that this decomposition is not unique. A convenient choice is a modified form of the singular-value decomposition, where one constrains the first column of S realized to be a column of ones, and one constrains the other columns ofS realized to be orthogonal to the first (a detailed description of this decomposition is given in Appendix D).
If quantum theory is the correct theory of nature, then the experimental data should be consistent with the GPT state space being the Bloch Ball and the GPT effect space being the Bloch Diamond (depicted in Fig. 1(a)), up to a linear invertible transformation.
Our estimate of the realized GPT state space,S realized , is simply the convex hull of the rows of the matrix S realized . In the case of the effects, we can again take convex mixtures, but because one also has the freedom to post-process measurement outcomes, our estimate of the realized GPT effect space is slightly more complicated.
There are two classes of convexly extremal classical post-processings that can be performed on a binaryoutcome measurement. We call the first class of convexly extremal post-processings the outcome-swapping class. In such a post-processing, the outcome returned by a measurement device is deterministically swapped to the other outcome. The outcome-swapping of the outcome-0 effect for a specific measurement procedure, e We call the second class of convexly extremal postprocessings the outcome-fixing class. In such a postprocessing, the outcome returned by a measurement device is ignored, and deterministically replaced by a fixed outcome, 0 or 1. For the case where the outcome is replaced by 0, the image of this post-processing is the unit effect u, and for the case where it is replaced by 1, the image is the complement of the unit effect (represented by the zero vector).
The full set of post-processings is obtained by tak-ing all convex mixtures of these extremal ones. Hencẽ E realized is the closure under convex mixtures and classical post-processing of the vectors defined by the columns of the matrixẼ realized . As we have already included the unit measurement effect inD realized , it is represented iñ E realized as well. Therefore,Ẽ realized is the convex hull of the union of the set of column vectors in the matrix E realized and the set of their complements. Our estimate of the realized GPT state space,S realized , and our estimate of the realized GPT effect space, E realized , are displayed in Fig. 7(a)-(c). Omitting the first column ofS realized (because it contains no information), we visualize the realized GPT state space by plotting the convex hull of the vectors defined by the last three entries of each row ofS realized in a 3-dimensional space (the solid light blue polytope in Fig. 7(a)). As all four entries of each column ofẼ realized contain information, the convex hull of the vectors defined by these is 4-dimensional. To visualize the realized GPT effect space, therefore, we plot two 3-dimensional projections of it, namely, the projections e → (e 0 , e 1 , e 2 ) and e → (e 1 , e 2 , e 3 ) (the solid light green polytopes in Figs. 7(b) and 7(c) respec-tively). 9 Qualitatively, S realized is a ball-shaped polytope, andẼ realized is a four-dimensional diamond with a ball-shaped polytope as its base. Note that they are qualitatively what one would expect if quantum theory is the correct description of nature.
Next, we compute the duals of these spaces. How this is done is described in detail in Appendix E. Our estimate of the set of GPT state vectors that are consistent with the realized GPT effects,S consistent = dual(Ẽ realized ), is plotted alongsideS realized in Fig. 7(a) as a wireframe polytope. Similarly, our estimate of the set of GPT effect vectors consistent with the realized GPT states, E consistent = dual(S realized ), is plotted as a wireframe alongsideẼ realized in Fig. 7(b),(c).
The smallness of the gap betweenS realized and S consistent implies that the possibilities for the true GPT are quite limited. Obviously, our results easily exclude all of the nonquantum examples of GPTs presented in Fig. 1.
Our results can be used to infer limits on the extent to which the true GPT might fail to satisfy the norestriction hypothesis. One way of doing so is by bounding the volume ratio of S to S logical . From the discussion in Sec. II D, it is clear that this is upper bounded by the volume ratio of S realized to S consistent . Given our estimates of the latter two spaces, we can compute an estimate of this ratio. We find it to be 0.9229 ± 0.0001.
The error bar is the standard deviation in the volume ratio from 100 Monte Carlo simulations. We begin each simulation by simulating a set of coincidence counts. Each set of counts is found by sampling each count from a Poisson distribution with mean and variance equal to the number of photons counted in the true experiment 10 . To our knowledge, this is the first quantitative limit on the extent to which the GPT governing nature might violate the no-restriction hypothesis.

F. Increasing the number of experimental configurations
Because the vertices of the polytopes describing S realized in Figs. 7(a)-(c) are determined by the finite set of preparations and measurement effects that were implemented, the observed deviation from sphericity is 9 This is the same pair of projections used to visualize the 4dimensional GPT effect spaces depicted in Fig. 1. 10 Since our analysis procedure includes a constrained optimization, it is difficult to apply standard error analysis techniques to determine how errors in the measured outcome frequencies affect the GPT state and effect vectors returned by the optimization step. This is why we use a Monte Carlo analysis to estimate the errors on our estimates of the realized GPT states and effects. We note that more sophisticated error-analysis methods might give us a better estimate of the true size of the errors in our experiment, however, development of such techniques is outside the scope of this work.
obviously an artifact of an insufficiently dense set of experimental configurations, and not evidence for any lack of smoothness of the true GPT state and effect spaces. A higher density of experimental configurations probed in bothS realized andS consistent would imply a more constrained set of possibilities for S and S logical . For instance, with a denser set of experimental configurations, the volume ratio ofS realized toS consistent would provide a tighter upper bound on the volume ratio of S to S logical . 11 As such, having a much denser set of experimental configurations would allow one to put a stronger bound on possible deviations from quantum theory, and in particular on possible deviatons from the no-restriction hypothesis. There is therefore a strong motivation to increase the number m of different preparations and the number n of different measurement effects that are probed in the experiment. It might seem at first glance that doing so is infeasible, on the grounds that it implies a significant increase in the number, mn, of preparation-measurement pairs that need to be implemented and thus an overwhelmingly long data-acquisition time.
However, this is not the case; one can probe more preparations and measurements by not implementing every measurement on every preparation. The key insight is that in order to characterize the GPT state vector associated to a given preparation, one needn't find its statistics on every measurement effect in the set being considered: it suffices to find its statistics on a subset thereof, namely, any tomographically complete subset of measurement effects. Similarly, in order to characterize the GPT effect vector associated to a given measurement effect, one need not implement it on the full set of preparations being considered, but just a tomographically complete subset thereof. The first experiment provided evidence for the conclusion that the tomographically complete sets have cardinality 4. It follows that one should be able to characterize m preparations and n measurements with just 4(m + n − 4) experimental configurations, rather than mn.
Despite the good evidence about the cardinality from the first experiment, we deemed it worthwhile to perform the second experiment in such a manner that the analysis of the data did not rely on any evidence drawn from the first experiment. Furthermore, we were motivated to have the second experiment provide an independent test of the hypothesis that the cardinality of the tomographically complete sets is indeed four. Given that the closest competitors to the rank-4 model on either side were those of ranks 3 and 5, we decided to restrict our set of candidate models to those having ranks in the set k ∈ {3, 4, 5}. In order for the experimental data to be able to reject the hypothesis of rank k as a bad fit, it is necessary that one have at least k + 1 measurements implemented on each preparation, and at least k+1 preparations on which each measurement is implemented; otherwise, one can trivially find a perfect fit. To be able to assess the quality of fit for a rank-5 model, therefore, we needed to choose at least 6 measurements that are jointly tomographically complete to implement on each of the m preparations and at least 6 preparations that are jointly tomographically complete on which each of the n measurements is implemented. We chose to use precisely 6 in each case, yielding a total of 6(m + n − 6) experimental configurations. Without exceeding the bound of ∼ 10 4 experimental configurations being probed (implied by the data acquisition time), we were able to take m = n = 1000 and thereby probe a factor of 10 more preparations and measurements than in the first experiment.
We refer to the set of six measurement effects (preparations) in this second experiment as the fiducial set. Our choice of which six waveplate settings to use in each of the fiducial sets is described in Appendix B. Our choice of which 1000 waveplate settings to pair with these is also described there. Our choices are based on our expectation that the true GPT is close to quantum theory and the desire to densely sample the set of all preparations and measurements. (Note that although our knowledge of the quantum formalism informed our choices, our analysis of the experimental data does not presume the correctness of quantum theory.) In the end, we also implemented each of our six fiducial measurement effects on each of our six fiducial preparations, so that we had m = n = 1006.
We also add the unit measurement effect to our set of effects. We thereby arrange our data into a 1006×1007 frequency matrix F , with the big difference to the first experiment being that F now has a 1000×1000 submatrix of unfilled entries.
We perform an identical analysis procedure to the one described in Sec. III D: for each k in the candidate set of ranks, we seek to find the rank-k matrixD realized of bestfit to F . For the entries in the 1000×1000 submatrix of D realized corresponding to the unfilled entries in F , the only constraint in the fit is that each entry be in the range [0, 1], so that it corresponds to a probability. The results of this analysis are presented in Fig. 6(d)-(f).
The χ 2 goodness-of-fit test (Fig. 6(d)) rules out the rank-3 model, and therefore all models with rank less than 3 as well. Calculating the AIC scores for the maximum-likelihood rank 3, 4 and 5 models shows that the rank-4 model is the one among these that is most likely to describe the data (Fig. 6(e),(f)). Indeed the relative probability of the rank 5 model is on the order of 10 −414 .
The reason that the likelihood of the rank 5 model is so low is because the number of parameters required to specify a rank-k m × n matrix is r k = k(m + n − k), and since m = n ∼ 1000, the rank-5 model requires ∼ 2000 more parameters than the rank-4 model. The number of model parameters is multiplied by a factor of two in the formula for the AIC score, and the difference between χ 2 5 and χ 2 4 is only ∼ 2000. This means that if the AIC score is used to calculate the likelihood of each model, the rank 5 model is ∼ e −2000/2 ∼ 10 −414 as likely as the rank 4 model.
The AIC formula we use was derived in the limit where the number of data points is much greater than the number of parameters in the model. In our second experiment the number of data points is roughly equal to the number of parameters in each model, and thus any conclusions which derive from use of the AIC formula must be taken with a grain of salt. We should instead use a corrected form of the AIC, called AIC C [61]. However, the formula for AIC C depends on the specific model being used, and to the best of our knowledge a formula has not been found for the weighted low rank approximation problem. However, every AIC C formula that we found for different types of models increased the amount by which models were penalized for complexity [61]. Hence we hypothesize that the proper AIC C formula would lead to an even smaller relative likelihood for the rank 5 model, and thus that we have strong evidence that a rank 4 model should be used to represent the second experiment. Finding the correct AIC C formula for the weighted low rank approximation problem is an interesting problem for future consideration.
Modulo this caveat, the second experiment corroborates the conclusion of the first experiment, namely, that our best estimate of the dimension of the GPT governing single-photon polarization is 4. 12 We decompose the rank-4 matrix of best fit and plot our estimates of the realized state space,S realized , and the realized effect space,Ẽ realized , in Fig. 7(d)-(f). The realized GPT state and effect spaces reconstructed from the second experiment are smoother than those from the first, and the gap betweenS realized andS consistent is smaller as well.
The volume ratio ofS realized toS consistent is found to be 0.977 ± 0.001, where the error bar is calculated from 12 As noted in our discussion of the first experiment, however, such a conclusion can in principle be overturned by future experiments if the preparations and measurements that are conventional for photon polarization exclude some exotic variety (or have undetectably small components of this exotic variety) and therefore fail to be tomographically complete.
100 Monte Carlo simulations. Compared to the first experiment, this provides a tighter bound on any failure of the no-restriction hypothesis.

A. Consistency with quantum theory
We now check to see if the possibilities for the true GPT state and effect spaces implied by our experiment include the quantum state and effect spaces.
As noted in Sec. II D, because it is in practice impossible to eliminate all noise in the experimental procedures, we expect that under the assumption that all of our realized preparations are indeed represented by quantum states, they will all be slightly impure (that is, their eigenvalues will be bounded away from 0 and 1). Their GPT state vectors should therefore be strictly in the interior of the Bloch Sphere. Similarly, we expect such noise on all of the realized measurement effects (with the exception of the unit effect and its outcome-swapped counterpart, which are theoretical abstractions), implying that their GPT effect vectors will be strictly in the interior of the 4-dimensional Bloch Diamond. This, in turn, implies that the extremal GPT state vectors in S consistent will be strictly in the exterior of the Bloch Sphere. The size of the gap between S realized and S consistent , therefore, will be determined by the amount of noise in the preparations and measurements.
Naïvely, one might expect that for the quantum state and effect spaces for a qubit to be consistent with our experimental results, S qubit must fit geometrically between our estimates of S realized and S consistent , up to a linear transformation. That is, one might expect the condition to be that there exists a linear transformation of S qubit that fits geometrically betweenS realized andS consistent .
However, noise in the experiment also leads to statistical discrepancies between the vertices ofS realized and those of S realized , and between the vertices ofẼ realized and those of E realized . This noise could lead to estimates of the realized GPT state and effect vectors being longer than the actual realized GPT state and effect vectors. If the estimates of any of these lie outside the qubit state and effect spaces, then one could find that it is impossible to find a linear transformation of S qubit that fits betweeñ S realized andS consistent , even if quantum theory is correct! We test the above intuition by simulating the first experiment under the assumption that quantum theory is the correct theory of nature. We assume that the states we actually prepare in the lab are slightly depolarized versions of the set of 100 pure quantum states that we are targeting, and that the measurements we actually perform are slightly depolarized versions of the set of 100 projective measurements we are targeting. We estimate the amount of depolarization noise from the raw data, and use the estimated amount of noise to calculate the outcome probabilities for each depolarized measurement on each depolarized state. We arrange these probabilities into a 100 × 100 table and use them to simulate 1000 sets of photon counts, then analyse each of the 1000 simulated datasets with the GPT tomography procedure.
We find that, for every set of simulated data, we are unable to find a linear transformation of S qubit that fits between the simulatedS realized andS consistent , confirming the intuition articulated above.
Nonetheless, we can quantify the closeness of the fit as follows. We find that if, for each simulation, we artificially reduce the length of the GPT vectors in the simu-latedS realized andẼ realized by multiplying them by a factor slightly less than one, then we can fit a linearly transformed S qubit between the smallerS realized and larger S consistent . On average, we find we have to shrink the vectors making upS realized andẼ realized by 0.11% ± 0.02%, where the error bar is the standard deviation over the set of simulations. To perform the above simulations we used CVX, a software package for solving convex problems [62,63].
We quantify the real data's agreement with the simulations by performing the same calculation as on the simulated datasets. We first notice that there is no linear transformation of S qubit that fits betweenS realized andS consistent , as in the simulations. Furthermore, we find that we can achieve a fit if we shrink the vectors making upS realized andẼ realized by 0.14%, which is consistent with the simulations. Thus the spacesS realized andẼ realized reconstructed from the first experiment are consistent with what we expect to find given the correctness of quantum theory.
When analysing data from the second experiment it takes ∼ 4 hours to run the code that solves the weighted low rank approximation problem. It is therefore impractical to perform 1000 simulations of this experiment. Instead, we extrapolate from the simulation of the first experiment.
We note two significant ways in which the second experiment differs from the first. First, we perform approximately 10 times as many preparation and measurement procedures in the second experiment than in the first, yet accumulate roughly the same amount of data. Hence, each GPT state and effect vector in the second experiment is characterized with approximately 10 times fewer detected photons than in the first experiment, and so we expect the uncertainties on the second experiment's reconstructed GPT vectors to be ∼ √ 10 times larger than the same uncertainties in the first experiment. We expect this √ 10 increase in uncertainty to translate to a √ 10 increase in the amount we need to shrinkS realized and E realized before we can fit a linearly transformed S qubit betweenS realized andS consistent . Second,S realized and E realized each contain 1006 GPT vectors, a factor of 10 more than in the first experiment. Since there are a greater number of GPT vectors in the second experiment it is likely that the outliers (i.e., the cases for which our estimate differs most from the true vectors) in the second experiment will be more extreme than those in the first experiment. This should also lead to an increase in the amount we need to shrink the vectors inS realized andẼ realized before we can fit a linearly transformed S qubit betweenS realized andS consistent .
We find that, for the data from the second experiment, we need to shrinkS realized andẼ realized by 0.65%, a factor only 4 times greater than the 0.14% of the first experiment, which seems reasonable given the estimates above. We therefore conclude that the second experiment gives us no compelling reason to doubt the correctness of quantum theory.
The arguments presented above also support the notion that our experimental data is consistent with quantum theory according to the usual standards by which one judges this claim: if we had considered fitting the data with quantum states and effects rather their GPT counterparts (which one could accomplish by doing a GPT fit while constraining the vertices of the realized and consistent GPT state spaces to contain a sphere between them, up to linear transformations), we would have found that the quality of the fit was good.

B. Upper and lower bounds on violation of noncontextuality inequalities
One method we use to bound possible deviations from quantum theory is to consider the maximal violation of a particular noncontextuality inequality [49]. From our data we infer a range in which the maximal violation can lie, and compare this to the quantum prediction. We will briefly introduce the notion of noncontextuality, then discuss the inferences we make. The notion of noncontextuality was introduced by Kochen and Specker [64]. We here consider a generalization of the Kochen-Specker notion, termed universal noncontextuality, defined in Ref. [48].
Noncontextuality is a notion that applies to an ontological model of an operational theory. Such a model is an attempt to understand the predictions of the operational theory in terms of a system that acts as a causal mediary between the preparation device and the measurement device. It postulates a space of ontic states Λ, where the ontic state λ ∈ Λ specifies all the physical properties of the physical system according to the model. For each preparation procedure P of a system, it is presumed that the system's ontic state λ is sampled at random from a probability distribution p(λ|P ). For each measurement M on a system, it is presumed that its outcome O is sampled at random in a manner that depends on the ontic state λ, based on the conditional probability p(O|λ, M ). It is presumed that the empirical predictions of the operational theory are reproduced by the ontological model, We can now articulate the assumption of noncontextuality for both the preparations and the measurements. Preparation noncontextuality. If two preparation procedures, P and P ′ , are operationally equivalent, which in the GPT framework corresponds to being represented by the same GPT state vector, then they are represented by the same distribution over ontic states: To assume universal noncontextuality is to assume noncontextuality for all procedures, including preparations and measurements 13 .
There are now many operational inequalities for testing universal noncontextuality. Techniques for deriving such inequalities from proofs of the Kochen-Specker theorem are presented in [65][66][67]. In addition, there exist other proofs of the failure of universal noncontextuality that cannot be derived from the Kochen-Specker theorem. The proofs in Ref. [48] based on prepare-and-measure experiments on a single qubit are an example, and these too can be turned into inequalities testing for universal noncontextuality (as shown in Refs. [37] and [68]).
We here consider the simplest example of a noncontextuality inequality that can be violated by a qubit, namely the one associated to the task of 2-bit parity-oblivious multiplexing (POM), described in Ref. [49]. Bob receives as input from a referee an integer y chosen uniformly at random from {0, 1} and Alice receives a two-bit input string (z 0 , z 1 ) ∈ {0, 1} 2 , chosen uniformly at random. Success in the task corresponds to Bob outputting the bit b = z y , that is, the yth bit of Alice's input. Alice can send a system to Bob encoding information about her input, but no information about the parity of her string, z 0 ⊕ z 1 , can be transmitted to Bob. Thus, if the referee performs any measurement on the system transmitted, he should not be able to infer anything about the parity. The latter constraint is termed parity-obliviousness. 14 13 There is also a notion of noncontextuality for transformations [48], but we will not make use of it here. In fact, the noncontextuality inequality we consider is one that only makes use of the assumption of noncontextuality for preparations. 14 Parity-oblivious multiplexing is akin to a 2-to-1 quantum random access code. It was not introduced as a type of random access code in Ref. [49] because the latter are generally defined as having a constraint on the potential information-carrying capacity of the system transmitted, whereas in parity-oblivious multiplexing, the system can have arbitrary information-carrying capacity-the only constraint is that of parity-obliviousness.
An operational theory describes every protocol for parity-oblivious multiplexing as follows. Based on the input string (z 0 , z 1 ) ∈ {0, 1} 2 that she receives from the referee, Alice implements a preparation procedure P z0z1 , and based on the integer y ∈ {0, 1} that he receives from the referee, Bob implements a binary-outcome measurement M y , and reports the outcome b of his measurement as his output. Given that each of the 8 values of (y, z 0 , z 1 ) are equally likely, the probability of winning, denoted C, is where δ b,zy is the Kronecker delta function. The parity obliviousness condition can be expressed as a constraint on the GPT states, as This asserts the operational equivalence of the parity-0 preparation (the uniform mixture of P 00 and P 11 ) and the parity-1 preparation (the uniform mixture of P 01 and P 10 ), and therefore it implies a nontrivial constraint on the ontological model by the assumption of preparation noncontextuality (Eq. (5)), namely, It was shown in Ref. [49] that if an operational theory admits of a universally noncontextual ontological model, then the maximal value of the probability of success in parity-oblivious multiplexing is We refer to the inequality as the POM noncontextuality inequality. 15 It was also shown in Ref. [49] that in operational quantum theory, the maximal value of the probability of success is which violates the POM noncontextuality inequality, thereby providing a proof of the impossibility of a noncontextual model of quantum theory and demonstrating a quantum-over-noncontextual advantage for the task of parity-oblivious multiplexing. A set of four quantum states and two binary-outcome quantum measurements that satisfy the parity-obliviousness condition of Eq. (8) and that lead to success probability C Q are illustrated in Fig. 9. For a given GPT state space S and effect space E, we define where the optimization must be done over choices of {s Pz 0 z 1 } ∈ S that satisfy the parity-obliviousness constraint of Eq. (8). If S and E are the state and effect spaces of a GPT, then s Pz 0 z 1 · e b|My is the probability p(b|P z0z1 , M y ) and C (S,E) has the form of Eq. (7) and defines the maximum probability of success achievable in the task of parity-oblivious multiplexing for that GPT.
(We will see below that it is also useful to consider C (S,E) when the pair S and E do not define the state and effect spaces of a GPT.) As discussed in Section II D, no experiment can specify S and E exactly. Instead, what we find is a set of possibilities for (S, E) that are consistent with the data, and thus are candidates for the true GPT state and effect spaces. We denote this set of candidates by GPT candidates . To determine the range of possible values of the POM noncontextuality inequality violation in this set, we need to determine and See Fig. 8(a) for a schematic of the relation between the various C quantities we consider.
C min and C max are each defined as a solution to an optimization problem. As noted in Sec. II D, there is a large freedom in the choice of S given S realized and S consistent , and there is a large freedom in the choice of E for each choice of S. Finally, for each pair (S, E) in this set, one still needs to optimize over the choice of four preparations and two measurements defining the probability of success.
It turns out that the choice of (S, E) that determines C min is easily identified. First, note that the definition in Eq. (13) implies the following inference Given that S realized ⊆ S and E realized ⊆ E for all (S, E) ∈ GPT candidates , it follows that And given that (S realized , E realized ) is among the GPT candidates consistent with the data, we conclude that However, calculating C (S realized ,E realized ) still requires solving the optimization problem defined in Eq. (13), which is computationally difficult.
Much more tractable is the problem of determining a lower bound on C min , using a simple inner approximation to S realized and E realized . This is the approach we pursue here. We will denote this lower bound by LB(C min ).
Let S w qubit denote the image of the qubit state space S qubit under the partially depolarizing map D w , defined by with w ∈ [0, 1]. Similarly, let E w ′ qubit denote the image of E qubit under D w ′ . We also depict the states and effects that achieve the maximum probability of success in parity-oblivious multiplexing in quantum theory (orange squares), and those that achieve our lower (magenta circles) and upper (yellow triangles) bounds. The left figure depicts the GPT state vectors of the four preparations, labelled by the possible values of the pair of bits Alice must encode, and the right figure depicts the GPT effect vectors of each outcome of each of the pair of measurements.
Consider the 2-parameter family of GPTs defined by {(S w qubit , E w ′ qubit ) : w, w ′ ∈ (0, 1)}. These correspond to quantum theory for a qubit but with noise added to the states and to the effects. Letting w 1 be the largest value of the parameter w such that S w qubit ⊆ S realized and letting w ′ 1 be the largest value of the parameter w ′ such that E w qubit ⊆ E realized , then S w1 qubit and E w ′ 1 qubit provide inner approximations to S realized and E realized respectively, depicted in Fig. 9. From these, we get the lower bound A subtlety that we have avoided mentioning thus far is that the depolarized qubit state and effect spaces are only defined up to a linear transformation, so that in seeking an inner approximation, one could optimize over not only w but linear transformations as well. To simplify the analysis, however, we took S w qubit to be a sphere of radius w and E w ′ qubit to be a diamond with a base that is a sphere of radius w ′ , and we optimized over w and w ′ . (Optimizing over all linear transformations would simply give us a tighter lower bound.) For the GPT (S w qubit , E w ′ qubit ), a set of four preparations and two binary-outcome measurements that satisfy the parity-obliviousness condition of Eq. (8) and that yield the maximum probability of success are the images, under the partially depolarizing maps D w and D w ′ respec-tively, of the optimal quantum choices. These images are depicted in Fig. 9.
For this GPT, one finds that the probability of success in parity-oblivious multiplexing is the quantum value with probability ww ′ , and 1/2 the rest of the time, From our estimates of the realized GPT state and effect spaces,S realized andẼ realized , we obtain an estimate of w 1 by identifying the largest value of w such that S w qubit ⊆ S realized and we obtain an estimate of w ′ 1 by identifying the largest value of w ′ such that E w ′ qubit ⊆Ẽ realized . Determining these estimates from the data of the first experiment, then substituting into Eq. (21) and using Eq. (20), we infer the lower bound LB(C min ) = 0.8303 ± 0.0002. A similar analysis for the second experiment yields an even tighter bound, This provides a lower bound on the interval of C values in which the true value could be found, as depicted in Fig. 8(b). 16 We now turn to C max . Given that for all (S, E) ∈ GPT candidates , S ⊆ S consistent and E ⊆ E consistent , it follows from Eq. (16) that C max ≤ C (Sconsistent,Econsistent) . 17 We can therefore compute an upper bound on C max using outer approximations to S consistent and E consistent . We choose outer approximations consisting of rescaled qubit state and effect spaces, defined as before, but where the parameter w can now fall outside the interval [0, 1].
Letting w 2 be the smallest value of the parameter w such that S consistent ⊆ S w qubit and letting w ′ 2 be the smallest value of the parameter w ′ such that E consistent ⊆ E w ′ qubit , then S w2 qubit and E w ′ 2 qubit provide outer approximations to S consistent and E consistent respectively, and so we get an upper bound 16 Note that it is likely that this lower bound could be improved if one supplemented the preparations and measurements that were implemented in the experiment with a set that were targeted towards achieving the largest value of C (according to quantum expectations). 17 At this point, the analogy to the case of C min might lead one to expect that Cmax = C (S consistent ,E consistent ) . However, this is incorrect because the pair (S consistent , E consistent ) is not among the GPT candidates consistent with the experimental data. In fact, it does not even correspond to a valid GPT, as one can find a GPT state vector in S consistent and a GPT effect vector in E consistent with inner product outside the interval [0, 1], hence not defining a probability. Unfortunately, if one wants to calculate Cmax, it seems that one must perform the difficult optimization in Eq. (15).
Even though we are now allowing supernormalized state and effect vectors, via w and w ′ values outside of [0, 1], a simple calculation shows that C (S w qubit ,E w ′ qubit ) is still given by Eq. (21).
Our estimatesS consistent andẼ consistent for the state and effect spaces of the first experiment imply estimates for w 2 and w ′ 2 18 and substituting these into Eqs. (23) and (21), we infer UB(C max ) = 0.8784 ± 0.0002. The same analysis on the second experiment yields UB(C max ) = 0.8647 ± 0.0005.
This provides an upper bound on the interval of C values in which the true value could be found, as depicted in Fig. 8(b).
Recalling that the quantum value is C Q ≃ 0.8536, it follows from Eqs. (22) and (24) that the scope for the true GPT to differ from quantum theory in the amount of contextuality it predicts (relative to the POM inequality) is quite limited: for the true GPT, the maximum violation of the POM noncontextuality inequality can be at most 1.3% ± 0.1 less than and at most 1.3% ± 0.1 greater than the quantum value.

C. Upper bound on violation of Bell inequalities
Bell's theorem famously shows that a certain set of assumptions, which includes local causality, is in contradiction with the predictions of operational quantum theory [69]. It is also possible to derive inequalities from these assumptions that refer only to operational quantities and thus can be tested directly experimentally.
The Clauser, Horne, Shimony and Holt (CHSH) inequality [47] is the standard example. A pair of systems are prepared together according to a preparation procedure P AB , then one is sent to Alice and the other is sent to Bob. At each wing of the experiment, the system is subjected to one of two binary-outcome measurements, M A 0 or M A 1 on Alice's side and M B 0 and M B 1 on Bob's side, with the choice of measurement being made uniformly at random, and where the choice at one wing is space-like separated from the registration of the outcome at the other wing. Denoting the binary variable determining the measurement choice at Alice's (Bob's) wing by x (y), and the outcome of Alice's (Bob's) measurement by a (b), the operational quantity of interest, the "Bell quantity" for CHSH, is defined as follows (where 18 We note that the duality relation E consistent = dual(S realized ) implies that E w ′ 2 qubit = dual(S w 1 qubit ) and similarly, the relation S consistent = dual(E realized ) implies S w 2 qubit = dual(E w ′ 1 qubit ). This in turn implies that w ′ 2 = 1 w 1 and Cmax = 1 2 + 1 a, b, x, y ∈ {0, 1}, and ⊕ is addition modulo 2) B ≡ 1 4 a,b,x,y δ a⊕b,xy p(a, b|M A x , M B y , P AB ).
The maximum value that this quantity can take in a model satisfying local causality and the other assumptions of Bell's theorem is so that such models satisfy the CHSH inequality Meanwhile, the maximum quantum value is [70] Experimental tests have exhibited a violation of the CHSH inequality [71] and various loopholes for escaping this conclusion have been sealed experimentally [72][73][74][75][76][77]. These experiments provide a lower bound on the value of the Bell quantity, which violates the local bound.
It has not been previously clear, however, how to derive an upper bound on the Bell quantity. Doing so is necessary if one hopes to experimentally rule out post-quantum correlations such as the Popescu-Rohrlich box [52,70]. We here demonstrate how to do so.
First note that the probability for obtaining outcomes a and b given settings x and y, which appears in Eq. (25), can be expressed in the GPT framework as p(a, b|M A x , M B y , P AB ) = s P AB · (e a|M A x ⊗ e b|M B y ), (29) where s P AB is the GPT state on the composite system AB representing the preparation P AB (it is said to be entangled if it cannot be written as a convex mixture of states that factorize on the vector spaces of the components [29]), and where e a|M A x (e b|M B y ) is the GPT effect on A (B) representing the outcome a (b) of measurement M A x (M B y ). Learning that the M A x measurement was implemented on the preparation P AB and yielded the outcome a can be conceived of as a preparation for system B, which we denote by P B a|x . The GPT state representing this remote preparation, which we denote by s P B a|x , is defined by where we introduce the shorthand p a|x ≡ p(a|M A x , P AB ), and where I B represents the identity operator on system B. Given this definition, one can reexpress the probability appearing in the Bell quantity as which involves only GPT states and GPT effects on system B. In this case, one is conceptualizing the Bell experiment as achieving one of a set of remote preparations of the state of Bob's system-commonly referred to as "steering"-followed by a measurement on Bob's system.
The assumption of space-like separation implies that there is no signalling between Alice and Bob, and this constrains how Bob's system can be steered. Since p a|x is the probability that Alice obtains outcome a given that she performs measurement M A x on the preparation P AB , the marginal GPT state of Bob's subsystem when one does not condition on a is given by a p a|x s P B a|x . The nosignalling assumption forces this marginal state to be independent of Alice's measurement choice x. In the CHSH scenario the no-signalling constraint is summarized with the following equation: Because we are assuming that the true GPT includes classical probability theory as a subtheory (see Sec. II A), it follows that the local value, B loc , is a lower limit on the range of possible values of the Bell quantity among experimentally viable candidates for the true GPT. This is a trivial lower limit. In order to obtain a nontrivial lower limit on this range (i.e., one greater than B loc ), one would need to perform an experiment involving two physical systems such that one can learn which GPT states for the bipartite system are physically realizable (in particular, whether there are any entangled states that are realized) and thus which steering schemes are physically realizable. Because our experiment is on a single physical system, it cannot attest to the physical realizability of any bipartite states and hence cannot attest to the physical realizability of any particular instance of steering.
Nonetheless, our experiment can attest to the logical impossibility of particular instances of steering, namely, any instance of steering wherein the ensemble on Bob's system contains one or more GPT states outside of S consistent , because such states by definition assign values outside [0, 1]-which cannot be interpreted as probabilities-to some physically realized GPT effects (i.e., some GPT effects in E realized ). This in turn implies the nonexistence of any bipartite GPT state (together with a GPT measurement on Alice's system) which could be used to realize such an instance of steering, even though the experiment probes only a single system rather than a pair. Therefore, we can use our experimental results to determine an upper limit on the range of values of the Bell quantity among experimentally viable candidates for the true GPT.
The maximum violation of the CHSH inequality achievable if Bob's system is described by a state space } that satisfy the nosignalling constraint, Eq. (32). If the pair S and E together form a valid GPT, then p a|x s P B a|x · e b|M B y is a probability and we recover Eq. (25).
The upper limit on the range of possible values of the CHSH inequality violation among the theories in GPT candidates , which we denote by B max , is defined analogously to C max in Eq. (15).
Calculating B max is a difficult optimization problem that involves varying over every pair (S, E) consistent with the experiment, and for each pair implementing the optimization in Eq. (33).
Instead of performing this difficult optimization, we will derive an upper bound on B max , denoted UB(B max ). This is achieved in the same manner that the upper bound on C max was obtained in the previous section, namely, using a qubit-like outer approximation.
For qubit-like state and effect spaces, it turns out that the maximum violation of the CHSH inequality is the greater of 3 4 or the value given for the probability of success in POM in (21). The proof is provided in Appendix F.
This provides an upper bound on the interval of B values in which the true value of the maximal CHSH inequality violation lies, as depicted in Fig. 8(c). As noted earlier, our experiment only provides the trivial lower bound LB(B min ) = B loc . Nontrivial lower bounds have, of course, been provided in previous Bell experiments using photon polarization, such as Ref. [78].

V. DISCUSSION
We have described a scheme for constraining what GPTs can model a degree of freedom on which one has statistical data from a prepare-and-measure experiment. It proceeds by a tomographic characterization of the GPT states and effects that best represent the preparations and measurements realized in the experiment. By computing the duals of these, one constrains the possibilities for the true GPT state and effect spaces. The tomographic scheme is self-consistent in the sense that it does not require any prior characterization of the preparations and measurements.
The rank of the GPT describing the preparations and measurements realized in our experiment can be determined with very high confidence by our method. Because the models we consider have k(m + n − k) parameters, where k is the rank of the model, m is the number of preparations and n is the number of measurements, increasing the rank of the model by 1 increases the parameter count by hundreds in the first experiment and by thousands in the second. For this reason, the Akaike information criterion can deliver a decisive verdict against models that have a rank higher than the smallest rank that yields a respectable χ 2 on the grounds that such higher-rank models grossly overfit the data.
Our experimental results are consistent with the conclusion that in prepare-and-measure experiments, photon polarization acts like a 2-level quantum system, corresponding to a GPT vector space of dimension 4.
As emphasized in the introduction and Sec. III A, however, any hypothesis concerning the tomographic completeness of a given set of preparations or measurements is necessarily tentative. Our experiment provided an opportunity for discovering that the cardinality of a tomographically complete set of preparations (measurements) for photon polarization (or equivalently the dimension of the GPT describing them) deviated from our quantum expectations, but it found no evidence of such a dimensional deviation.
Under the assumption that the set of preparations and measurements we realized were tomographically complete, the technique we have described provides a means of obtaining experimental bounds on how the shapes of the state and effect spaces might deviate from those stipulated by quantum theory. We focused in this article on three examples of such deviations, namely, the failure of the no-restriction hypothesis, supra-quantum violations of Bell inequalities, and supra-quantum or sub-quantum violations of noncontextuality inequalities.
Modifications of quantum theory that posit intrinsic decoherence imply unavoidable noise and thereby a failure of the no-restriction hypothesis. We focused on the volume ratio of S logical to S as a generic measure of the failure of the no restriction hypothesis, and we obtained an upper bound on that measure via the volume ratio of S consistent to S realized . This provides an upper bound on the degree of noise in any intrinsic decoherence mechanism.
If one makes more explicit assumptions about the decoherence mechanism, one can be a bit more explicit about the bound. Suppose that the noise that arises from intrinsic decoherence in a prepare-and-measure experiment on photon polarization corresponds to a partially depolarizing map D 1−ǫ (Eq. (19)) where ǫ is a small parameter describing the strength of the noise, then GPT tomography would find S realized ⊆ S v qubit and E realized ⊆ E v ′ qubit where vv ′ = 1 − ǫ. The best qubit-like inner approximations to S realized and E realized , denoted by S w1 qubit and E w ′ 1 qubit in our article, define a lower bound on vv ′ , namely, w 1 w ′ 1 ≤ vv ′ , and thereby an upper bound on ǫ, namely, ǫ ≤ 1 − w 1 w ′ 1 . From our second experiment, we obtained the estimate w 1 w ′ 1 = 0.969 ± 0.001, which implies that ǫ ≤ 0.031 ± 0.001. We have also provided experimental bounds on the amount by which the system we studied could yield Bell and noncontextuality inequality violations in excess of their maximum quantum value.
Because violation of each of the inequalities we have considered is related to an advantage for some information-processing task-specifically, parityoblivious multiplexing and the CHSH game-it follows that our experimental upper bounds on these violations imply an upper bound on the possible advantage for these tasks. More generally, our techniques can be used to derive limits on advantages for any task that is powered by nonlocality or contextuality.
Our results also exclude deviations from quantum theory that have some theoretical motivation. For instance, Brassard et al. [79] have shown that communication complexity becomes trivial if one has CHSH inequality violations of 1 2 + 1 √ 6 ≃ 0.908 or higher. If one assumes that this is the actual threshold at which communication complexity becomes nontrivial (as opposed to being a nonstrict upper bound) and if one endorses the nontriviality of communication complexity as a principle that the true theory of the world ought to satisfy, then one has reason to speculate that the true theory of the world might achieve a CHSH inequality violation somewhere between the quantum bound of 0.8536 and 0.908. Our experimental bound, however, rules out most of this range of values.
Our experiment also provides a test (and exclusion) of the hypothesis of universal noncontextuality. In this capacity, it represents a significant improvement over the best previous experiment [37] expecially vis-a-vis what was identified in Ref. [37] to be the greatest weakness of that experiment, namely, the extent of the evidence for the claim that a given set of measurements or preparations should be considered tomographically complete. Recall that every assessment of operational equivalence among two preparations (measurements)-from which one deduces the nontrivial consequences of universal noncontextuality-rests upon the assumption that one has compared their statistics for a tomographically complete set of measurements (preparations).
The experiment reported in Ref. [37] implemented eight distinct effects and eight distinct states on singlephoton polarization and consequently it had the opportunity to discover that a GPT of dimension 4 did not provide a good fit to the data. In other words, the experiment reported in Ref. [37], just like the experiment reported here, had the opportunity to discover that the cardinality of the tomographically complete sets of effects and states for photon polarization (hence the dimension of the GPT) was not what quantum theory would lead one to expect, via the sort of precision strategy for detecting dimensional deviations described in the introduction and in Section III A. Consequently, it had an opportunity to discover that quantum expectations regarding operational equivalences were also violated.
The experimental test of noncontextuality reported in the present article, however, improves on that of Ref. [37] insofar as it provided a much better opportunity for detecting dimensional deviations from quantum theory and hence a much better opportunity for uncovering violations of our quantum expectations regarding what sets of preparations and measurements are tomographically complete, the grounds for all assessments of operational equivalences. In particular, instead of probing just eight states and effects, we probed one hundred of each in the first experiment and one thousand in the second, and then we explicitly explored the possibility that GPT models with rank greater than 4 might provide a better fit to the data. In particular, we used the Akaike criterion, which incorporates not only the quality of fit of a model (χ 2 ) but also the number of parameters it requires to achieve this fit, to determine which rank of model is most likely given the data.
It is important to recall that our experiment probed only a single type of system: the polarization degree of freedom of a photon. A question that naturally arises at this point is: to what extent can our conclusions be ported to other types of systems?
Consider first the question of portability to other types of two-level systems (by which we mean systems which are described quantumly by a two-dimensional Hilbert space). If it were the case that different two-level systems could be governed by different GPTs, this would immediately lead to a thorny problem of how to ensure that the different restrictions on their behaviours were respected even in the presense of interactions between them. Indeed, the principle that every n-level system has the same GPT state and effect spaces as every other has featured in many reconstructions of quantum theory within the GPT framework (see, e.g., the subspace axiom in Ref. [8], and its derivation from other axioms in Ref. [80]) and is taken to be a very natural assumption. This suggests that there are good theoretical grounds for thinking that our experimental constraints on possible deviations from quantum theory are applicable to all types of two-level systems.
It is less clear what conclusions one might draw for nlevel systems when n = 2. For instance, although quantumly the maximum violation of a CHSH inequality is the same regardless of whether Bob's system is a qubit or a qutrit, this might not be the case for some nonquantum GPT. Therefore, although there are theoretical reasons for believing that our upper bound on the degree of CHSH inequality violation (assuming no dimensional deviation) applies to all two-level systems, we cannot apply those reasons to argue that violations will be bounded in this way for n-level systems. Nonetheless, if one does assume that all two-level systems are described by the same GPT, then we have constraints on the state and effect spaces of every two-level system that is embedded (as a subspace) within the n-level system. This presumably restricts the possibilities for the state and effect spaces of the n-level system itself. How to infer such restrictions-for instance, how to infer an upper bound on the maximal CHSH inequality violation for a three-level system from one on a two-level system-is an interesting problem for future research.
There is evidently a great deal of scope for further experiments of the type described here. An obvious direction for future work is to apply our techniques to the characterization of higher dimensional systems and composites. Another interesting extension would be to generalize the technique to include GPT tomography of transformations, in addition to preparations and measurements. This is the GPT analogue of quantum process tomography, on which there has been a great deal of work due to its application in benchmarking experimental implementations of gates for quantum computation. It is likely that many ideas in this sphere can be ported to the GPT context. A particularly interesting case to consider is the scheme known as gate set tomography [81][82][83], which achieves a high-precision characterization of a set of quantum gates in a self-consistent manner. The 20 mm long PPKTP crystal is pumped with 0.29 mW of continuous wave laser light at 404.7 nm, producing pairs of 809.4 nm photons with orthogonal polarizations. We detect approximately 22% of the herald photons produced, and approximately 9% of the signal photons produced. In order to characterize the singlephoton nature of the source we performed a g 2 (0) measurement [84] and found g 2 (0) = 0.00184 ± 0.00003. This low g 2 (0) measurement implies that the ratio of double pairs to single pairs produced by the source is ∼ 1 : 2000. We found that if we increased the pump power then a rank 4 model no longer fit the data well. This is because the two-photon state space has a higher dimension than the one-photon state space. The avalanche photodiode single photon detectors we use respond nonlinearly to the number of incoming photons [85]; this makes our measurements sensitive to the multi-pair component of the downconverted light and ultimately limits the maximum power we can set for the pump laser.

Measurements
After a photon exits the measurement PBS, the probability that it will be detected depends on which port of the PBS it exited from. This is because the efficiencies of the two paths from the measurement PBS to the detector are not exactly equal, and also because the detectors themselves do not have the same efficiency. To average out the two different efficiencies we perform each measurement in two stages.
We will use language from quantum mechanics to explain our procedure. Say we want to perform a projective measurement in the |ψ -|ψ ⊥ basis, for some polarization |ψ and its orthogonal partner |ψ ⊥ . We first rotate our measurement waveplates so they rotate |ψ to the horizontal polarization, |H (and thus, |ψ ⊥ is rotated to the vertically polarized state |V ). In each output port, we record the number of photons detected in coincidence with the herald, over an integration time of four seconds. We label detections in the transmitted port with '0' and detections in the reflected port with '1'. Second, we rotate the measurement waveplates such that |ψ → |V and |ψ ⊥ → |H . We then swap the labels on the measurement outcomes such that the reflected port corresponds to outcome '0' and the transmitted port to '1'. We again record the number of coincidences between each output port and the herald for four seconds. Finally, we sum the total number of '0' detections, and also the total number of '1' detections over the total eight-second measurement time. The measured frequency at which we obtained outcome '0' is then the total number of '0' detections divided by the sum of the total number of '0' and '1' detections.

a. Threefold coincidences
Sometimes, all three detectors in the experiment fire within a single coincidence window. These events are most likely caused by either a multi-pair emission from the source, or the successful detection of both photons in a single pair in conjunction with a background count at the third detector. We choose to interpret each threefold coincidence as a pair of pairwise coincidences; one between the herald and transmitted port detectors, and one between the herald and reflected port detectors.
Since we are only interested in characterizing the single-pair emissions from our source (and not multipair ones), we could have chosen to instead discard all threefold-coincidence events completely. We note that if we had done this, the raw frequency data to which we fit our GPT would change, on average, by an amount that is only 0.01% of the statistical uncertainty on these frequencies. Using the Akaike information criterion, we would still have concluded that the GPT most likely to describe the data is rank 4. Finally, the probabilities in the rank-4 GPT of best fit would be essentially unchanged, and the shapes of the reconstructed GPT state and effect spaces (and therefore also the inferences made about the achievable inequality violations) would not be affected in any significant way.  (a) Red dots represent the six fiducial states used to characterize the 1000 measurements in Fig. 10(b). These correspond to the +1 and -1 eigenstates of the three Pauli operators σx, σy, and σz. (b) Red dots represent the six fiducial measurement effects used to characterize each of the states in Fig. 10(b). These effects lie on six of the twelve vertices of an icosahedron, and they correspond to the outcome-'0' effect of a projective measurement. Each outcome-'0' effect has a corresponding outcome-'1' effect; each outcome-'1' effect is represented by one of the other six vertices on the icosahedron.
For an m×n matrix of frequency data, F , we define the rank-k matrix of best fit,D, as the one that minimizes the weighted χ 2 value: where the weights ∆F ij are the uncertainties in the measured frequencies, which are calculated assuming Poissonian error in the counts (in cases where we did not collect data for the preparation-measurement pair corresponding to entry F ij , we set ∆F ij = ∞). SinceD represents an estimate of the true probabilities underlying the noisy frequency data, we need to ensure thatD only contains entries between 0 and 1. Hence the matrix of best fit is the one which solves the following minimization problem: where M mn is the space of all m × n real matrices. The entries in the column of ones (representing the unit measurement effect) that we include in F are exact, meaning that they have an uncertainty of 0. AsD is defined as the matrix that minimizes χ 2 , this enforces that the entries in the same column ofD will also remain exactly 1.
To enforce the rank constraint, we use the parameter-izationD =SẼ, whereS has size m × k andẼ is k × n. This minimization problem as stated is NP-hard [55], and cannot be solved analytically. However, if eitherS orẼ remains fixed, optimizing the other variable is a convex problem which can be solved with quadratic programming. We minimize χ 2 by performing a series of alternating optimizations overS andẼ [56].
Each iteration begins with an estimate forẼ, and we then consider a variation over the m × k matrixS such that the m × n matrixD =SẼ minimizes the χ 2 . Next, we fixS to be the one that achieved the minimum in this variation and we consider a variation over the k × n matrixẼ such thatD =SẼ minimizes the χ 2 . This is the end of one iteration, and the matrixẼ that achieved the minimum becomes theẼ for the beginning of the next iteration. The algorithm runs until a specific convergence threshold is met (i.e., if ∆χ 2 < 10 −6 between successive iterations), or until a maximum number of iterations (we choose 5000) is reached.
We will now show that optimization overS orẼ is convex (given that the other variable is fixed). For what follows, we will make use of the vec(·) operator, which takes a matrix and reorganises its entries into a column vector with the same number of entries as the original matrix. For example, given an m × n matrix A, vec(A) is a vector of length mn, and the first m entries of vec(A) are equal to the first column of A, entries m + 1 through 2m are equal to the second column of A, and so on. We also define a diagonal mn × mn matrix of weights, W , to encode the uncertainties (1/∆F ij ) 2 . These values appear along the diagonal of W , and they are appropriately ordered such that we can rewrite χ 2 in the more convenient form: where we have also made the substitutionD =SẼ. Defining I m as the m × m identity matrix, we can use the identity vec(SẼ) = (Ẽ T ⊗ I m ) vec(S) to write: and we now see that the minimization over P can be written as: As discussed in Section III E in the main paper, we find a decompositionD realized =S realizedẼrealized in order to characterize the estimates of the spaces realized by the experiment,S realized andẼ realized . Here,D realized has size m × n,S realized is m × k andẼ realized is k × n. In this appendix we describe the method we use to perform the above decomposition.
We choose the decomposition to ensure that the first column ofS realized is a column of ones, which allows us to representS realized in k−1 dimensions. (In our experiment we found k = 4, but we will use the symbol k in this appendix for generality.) We achieve this by ensuring that the leftmost column inD realized is a column of ones representing the unit measurement, such thatD realized takes the form: We then proceed to perform the QR decomposition [88] D realized = QR, where R is an m × n upper-right triangular matrix and Q an m × m unitary matrix. Becausẽ D realized has the form of Eq. (D1), each entry in the first column of Q will be equal to some constant c. We define Q ′ = Q/c and R ′ = cR, which ensures that the first column of Q ′ is a column of ones.
Next, we partition Q ′ and R ′ as Q ′ = Q 0 Q 1 and where Q 0 is the first column of Q ′ , Q 1 is all remaining columns of Q ′ , R 0 is the first row of R ′ , and R 1 is all remaining rows of R ′ . We take the singular value decomposition Q 1 R 1 = U ΣV T . Q 1 R 1 is rank-(k−1), and thus only has (k − 1) nonzero singular values. Hence we can partition U , Σ, and V as U = U k−1 U (k−1)⊥ , Σ = Σ k−1 0 0 0 , and V = V k−1 V (k−1)⊥ . Here Σ k−1 is the upper-left (k−1)×(k−1) corner of Σ, and U k−1 and V k−1 are the leftmost (k −1) columns of U and V , respectively. Finally, we defineS realized andẼ realized asS realized = The procedure described above ensures thatS realized andẼ realized take the forms: where s (u) t is the t-th element of the GPT state vector representing the u-th preparation, and e If we chose to, we could simply take theẼ realized returned by the decomposition ofD realized that we described above, and define the larger matrix Ẽ realized 1 −Ẽ realized , and the convex hull of the vectors in this larger matrix would define our estimate, E realized , of the space of GPT effects realized in the experiment.
However, in an attempt to treat the outcome-0 and outcome-1 effect vectors on equal footing, we instead define the larger matrixD R = D realized 1 −D realized . We then find a decompositionD R =S realizedẼR using the method described above. This ensures thatẼ R has the form: The spacesS consistent andẼ consistent are the duals of the realized spacesẼ realized andS realized , respectively. Here we will discuss how we calculate the consistent spaces from the realized ones.
We start with the calculation ofS consistent . By definition,S consistent is the intersection of the geometric dual of E and the set of all normalized GPT states; specifically, the set of s ∈ R k such that ∀e ∈Ẽ realized : 0 ≤ s · e ≤ 1 and such that s·u = 1. This definition (called an inequality representation) completely specifiesS consistent . However, in order to perform transformations on the space or calculate its volume, it can be useful to have its vertex description as well, which is a list of vertices that completely specify the space's convex hull. Finding a convex polytope's vertex representation given its inequality representation is called the vertex enumeration problem [89].
To find the vertex representation ofS consistent , we first simplify its inequality representation. SinceẼ realized is a convex polytope, we don't need to consider every e inẼ realized , but only the vertices ofẼ realized . If we denote the set of vertices ofẼ realized by Vertices Ẽ realized , then we can replace the ∀e ∈Ẽ realized in the definition ofS consistent with ∀e ∈ Vertices Ẽ realized . Calculation of Vertices Ẽ realized is performed with the pyparma [90] package in Python 2.7.6. The calculation of the vertex description ofS consistent is performed with an algorithm provided by Avis and Fukuda [89]. We use functions in pyparma [90] which call the cdd library [91] to find the vertex description ofS consistent .
Finding the vertex description ofẼ consistent from S realized is done in an analogous way.Ẽ consistent is defined as the geometric dual of the space that is the sub-normalization ofS realized , {ws : s ∈S realized , w ∈ [0, 1]}. The subnormalization ofS realized is also the convex hull of the union of the GPT state vectors that make up the rows ofS realized and the GPT state vector with s 0 = · · · = s k−1 = 0 that represents the state with normalization zero.
Appendix F: Maximal CHSH inequality violations with qubit-like state spaces We here provide a proof of the fact that the optimal value of the CHSH inequality when Bob's system is described by a qubit-like state and effect space is the same as the value of the POM noncontextuality inequality for the same case, provided that the latter is at least 3 4 , that is, We begin with a geometric charachterization of S w qubit and E w ′ qubit . Recalling the Bloch representation of S qubit and E qubit from Sec. II B, and noting that the maximally mixed state is represented by (1, 0, 0, 0), applying D w from Eq. (19) gives S w qubit as a ball of radius w, i.e. (1, s 1 , s 2 , s 3 ) with s 2 1 + s 2 2 + s 2 3 ≤ w. Similarly E w ′ qubit is a "Bloch diamond" with radius w ′ , i.e., (e 0 , e 1 , e 2 , e 3 ) with 0 ≤ e 0 ≤ 1 and e 2 1 + e 2 2 + e 2 3 ≤ w ′ min{e 0 , 1 − e 0 }. In particular, E w ′ qubit is the convex hull of (0, 0, 0, 0), (1, 0, 0, 0) and effects of the form 1 2 , e 1 , e 2 , e 3 with e 2 1 + e 2 2 + e 2 3 = 1 2 w ′ . Thus this GPT shares with a qubit the feature that all binary-outcome measurements are convex combinations of (the analog of) projective measurements. Specifically, the extremal binary-outcome measurements consist of the trivial binary-outcome measurement with effects (0, 0, 0, 0) and (1, 0, 0, 0), and the nontrivial binary-outcome measurements with effects 1 2 , e 1 , e 2 , e 3 and 1 2 , −e 1 , −e 2 , −e 3 with e 2 1 + e 2 2 + e 2 3 = 1 2 w ′ . Recall from Eq. (33) that we are interesting in maximizing 1 4 a,b,x,y δ a⊕b,xy p a|x s P B a|x · e b|M B y , over {p a|x }, {s P B a|x } that satisfy the no-signalling constraint, Eq. (32), and over {e b|M B y }. For each b, Eq. (F2) is convex-linear in Bob's effects e b|M B y . Hence it suffice to maximize Eq. (F2) over the convexly extremal binary-outcome measurements. In particular, Bob's optimal strategy will be one of two possibilities: at least one of his measurements is trivial, or both of his measurements are nontrivial.
First, consider the case where the optimum is achieved when one of Bob's measurements is trivial, i.e., has effects (0, 0, 0, 0) and (1, 0, 0, 0). Clearly this measurement can be implemented jointly with any other measurement, regardless of whether this other measurement is trivial or not. But violating a bipartite Bell inequality such as CHSH requires that both parties use incompatible measurements [92]. Hence the maximum value of Eq. (F2) for this case cannot exceed B loc = 3 4 . Indeed this value can be achieved with both of Bob's measurements being trivial, for example by having Alice and Bob always output a = b = 0. Therefore, in this case Now consider the case where the optimum is achieved when both of Bob's measurements are nontrivial, i.e., for each (b, y), e b|M B y = 1 2 , e 1 , e 2 , e 3 with e 2 1 + e 2 2 + e 2 3 = Furthermore, the no-signalling constraint Eq. (32) can be written as In the case ww ′ = 1, we recover the usual problem of maximizing the CHSH value where Bob does projective measurements on a qubit, for which the maximum value B Q is given in Eq. (28). (The fact that we can optimize over the ensembles of states to which Alice steers rather than optimizing over the bipartite state and Alice's measurements follows from the Schrödinger-HJW theorem [93,94].) Since the only place that w and w ′ appear in the problem is before the sum in Eq. (F4), and since ww ′ > 0, it is clear that an optimal strategy for our problem will use the same p a|x ,s P B a|x andẽ b|M B y as in the ww ′ = 1 case. Hence, if the optimal strategy uses a pair of nontrivial measurements, then giving where we have used Eq. (21). It follows that the optimal strategy achieves the maximum of Eq. (F3) and Eq. (F8), which establishes Eq. (F1).