Computational Advantage from the Quantum Superposition of Multiple Temporal Orders of Photonic Gates

Playing with causal orders leads to quadratic speed-up: An efficient algorithm is introduced and implemented in a scalable fiber-optic based experiment that achieves, for the first time, a quantum 4-switch gate.


I. INTRODUCTION
Quantum mechanics allows for processes where two or more events take place in a quantum superposition of different temporal orders. This exotic phenomenon results in causal nonseparability [1][2][3], and it is likely to be especially relevant in quantum treatments of gravity [4][5][6]. In fact, quantum control of temporal orders could be realized with quantum circuits exploiting hypothetical closed timelike curves [7,8], and it would also arise naturally due to the spacetime warping that macroscopic spatial superpositions of massive bodies would cause [9].
From a more practical perspective, advanced quantum computational models without definite gate orders have sparked a great deal of fundamental interest, as they do not fit into the usual paradigm of circuits with fixed gate connections [6,7,[10][11][12][13]. The best-known example is the celebrated quantum N -switch gate S N , which coherently applies a different permutation of N given gates on a target quantum system conditioned on the state of a control quantum system [7,13,14]. The quantum N -switch 2691-3399/21/2(1)/010320 (16) 010320-1 Published by the American Physical Society has been identified as a resource for a number of exciting information-theoretic tasks. For instance, for N = 2, it allows one to deterministically distinguish pairs of commuting versus anticommuting unitaries [12]; remarkably, this translates into an exponential advantage in a communication complexity problem [15,16].
In general, circuits that synthesize S N with a fixed gate order are known, but at the expense of quadratically more queries to (i.e., uses of) the gates [12][13][14]17]. As a consequence thereof, S N allows one to solve a promise problem [12,14] on the permutations of N unknown unitary gates with quadratically fewer queries in N than all known circuits with fixed gate order. More precisely, the permutation sequences of the gates are promised to differ only by a phase factor, and S N efficiently estimates these phase differences. However, the algorithm for this problem [12,14] requires the target-system dimension to grow (super)exponentially with N , making it experimentally demanding. In fact, all experimental realizations of the quantum N -switch reported so far are restricted to the simplest case of N = 2 gate orders [16,[18][19][20][21][22].
In this work, we introduce a novel algorithm that exploits the quantum N -switch and experimentally demonstrate it for N = 4 unitary gates. Specifically, we find a variant of the above phase-estimation problem, which we name the Hadamard promise problem, for which the quantum N -switch is also a resource but with considerably milder constraints on the target-system dimension. On the one hand, this problem plays a role in computation with indefinite gate orders analogous to Deutsch-Jozsa's [23] or Simon's [24] problems in the beginnings of quantum computation: a proof of principle of improvements over a previous paradigm. On the other hand, there are reasons to expect that practical applications of the Hadamard promise problem will be developed, both because closely related phase-estimation problems already have many applications, and because it involves the quantum Fourier transform, which is an important subroutine for a variety of quantum algorithms with practical applications [25]. The problem's promise is that the products of the N unknown gates applied in P different orders differ only in + or − signs that are encoded into one of the columns of a given (P×P)-dimensional Hadamard matrix; the problem consists of finding which column it is.
The algorithm to solve this problem exploits the quantum N -switch-consuming N queries to the gates-to deterministically find the column. This represents a speedup quadratic in N in query complexity (i.e., number of queries) with respect to all known algorithms exploiting circuits with fixed gate orders (see Refs. [14,26,27] for a discussion of how to count queries in a quantum switch). Hence, the algorithm is not only an interesting computational primitive on its own but also a practical tool to benchmark experimental realizations of S N , because the quantum N -switch is the only known process for which the algorithm succeeds with unit probability for all gates satisfying the promise while only consuming N gate queries. To demonstrate the practicability of the algorithm, we implement it with a quantum N -switch of N = 4 gates using modern multicore optical-fiber technology [28][29][30][31]. The four gates are implemented on the target polarization qubits using programmable liquid-crystal devices, and the spatial degree of freedom of a single photon is used as the control system. We obtain an average success probability for the algorithm, over different sets of gates, of p succ ≈ 0.95. Our results represent the first demonstration of the quantum N -switch gate for N larger than 2, as well as of its efficiency for phase-estimation problems involving multiple unknown gates.

A. Quantum control of gate orders
In quantum computation, a quantum switch can be described by a special type of controlled operation that applies a particular unitary gate x to a target system (t) for each different state of a control system (c). We define the quantum N -switch gate as where |x c is the xth member of the computational basis of the control system and | t is an arbitrary state of the target system. The heart of the quantum N -switch is the , which is a product of the N unitary gates in a fixed set U := {U A , U B , . . .} in their xth ordering. More precisely, σ x is a vector with N elements specifying the xth permutation of the N gates in U, i.e., it specifies the ordering sequence of the unitaries, so that σ x (j ) is the j th element in the xth permutation. To control the implementation of P, different permutations of gates requires a control system of at least dimension P. The dimension of the target system can be arbitrary and we denote it as d. With S N defined as in Eq. (1), it is clear that c coherently controls the order of the N unitary gates applied to system t, which explains the name "quantum control of gate orders" (QCGO). We note that the usual definition [13,14] of the quantum N -switch deals only with the specific case of all N ! permutations of the gates in U. However, here (as in Refs. [32,33]) we are interested in the more general case P ≤ N !. Clearly, the general definition of QCGO is independent of the specific choice of gates in U. A convenient mathematical tool to capture that is the quantum N -switch process W N , which produces the quantum Nswitch gate S N when given the set of gates U as input. For the technical definition of processes, we refer the reader to Refs. [1][2][3]34]. Intuitively, one can think of a process as the quantum evolution generated by an experimental arrangement with open slots for gates on the 010320-2 The process, W 4 (light-gray region), can be thought of as an experimental setup (e.g., a quantum circuit or interferometer) through which the composite control-target system goes and with open slots for target-subsystem gates U i (dark-gray boxes) for i = A, B, C, or D to be inserted. Inside W 4 , the connections between these gates are coherently controlled by the control subsystem, an effect known as quantum control of gate orders. This property is a physical resource for certain quantum computations (phase-estimation problems), and W 4 is the resourceful object that bears it. The concatenation of W 4 with the inserted gates yields the quantum 4-switch gate S 4 , a joint unitary operation on the composite system. (b) Concrete schematics of the specific variant of the quantum 4-switch process experimentally implemented in this work. The target subsystem undergoes the four-gate sequence in a quantum superposition (center) of P = 4 different orderings (permutations of the string ABCD): ABCD, BADC, CBDA, DACB. Each permutation is shown individually in a different color and panel. (c) In the abovementioned computations, the target-subsystem gates are unknown. For the purpose of complexity analysis, they can be thought of as produced upon request by a quantum oracle O. This takes as input i = A, B, C, or D and outputs a black-box device implementing the unknown gate U i . Each such call to the oracle counts as an oracle query. The N -switch process allows one to solve computational problems on the phase relationships between permutations of the black-box gates with considerably fewer oracle queries-i.e., lower query complexity-than any process with fixed (or classically controlled) gate connections.
target system to be inserted [10,11], as represented in Fig. 1(a). Inside the process, the connections between the inserted gates may be subject to the quantum superposition principle. For instance, in Fig. 1(b) we pictorially represent our experimental implementation of the quantum 4-switch gate S 4 , with a coherent quantum superposition of P = 4 different gate connections (each one in a different color) for the particular choice of permutation set {ABCD, BADC, CBDA, DACB}. Such superpositions give rise to QCGO, which corresponds to a specific type of quantum control of causal orders [35] (and both phenomena are in turn contained within the general notion of causal nonseparability [1][2][3]). In particular, QCGO takes place when those gate connections are coherently controlled by a control system, as in Eq. (1). Aside from being a fundamentally interesting phenomenon, QCGO turns out to be a physical resource for interesting phase-estimation problems, as we discuss next.

B. The Araújo-Costa-Brukner algorithm
The quantum N -switch process provides an advantage for solving a particular phase-estimation problem [12,14] to which we here refer as the Fourier promise problem. In this type of problem, one has access to a quantum oracle O for U, i.e., a black-box device that delivers a gate U i ∈ U every time it is queried. See Fig. 1(c). No information about the gates is available except for the promise that, for the constant phase factor ω := e i2π/P and all x ∈ [P], they satisfy the property that for some fixed, unknown y ∈ [P], where the shorthand notation [P] := {0, 1, . . . , P − 1} has been introduced. The task is to determine which one of the properties holds, i.e., to find y.
The Araújo-Costa-Brukner algorithm to solve this problem is based on the standard Hadamard test [36], and shares similarities with the Kitaev phase-estimation algorithm [37]. The control system is initialized in the computational-basis reference state |0 c , while the target system starts in an arbitrary state | t . A P-dimensional quantum Fourier transform F P on c maps it to a uniform superposition of all computational-basis states. Then, the quantum N -switch gate is applied. Because of property (2), this introduces the phase factor ω xy to each computationalbasis state |x c in the superposition, while the state 0 | t of the target system factorizes. The value of y is thus encoded into the phases of the superposition state of the control system. To map it back to the computational basis, one uncomputes the Fourier transform (applying its inverse F −1 P = F † P ). In symbols [14], Then, y is finally read out by a single-shot computationalbasis measurement on c.

010320-3
To apply S N , one must consume N queries to O. Therefore, the query complexity-i.e., total number of oracle queries-of the algorithm is Q = N for all P ≤ N !. Remarkably, causally ordered processes (i.e., those produced by circuits with fixed, or classically controlled, gate connections) require considerably more queries to solve the same problem. For instance, for P = N !, the best causally ordered process displays query complexity Q = (N 2 ) [13,14,17], i.e., quadratically higher in N . A downside of the algorithm, however, is that the targetsystem dimension d must grow with the number P of gate orders. This can be seen [14] by taking the determinant of both sides of Eq. (2). For y = 1, and since det x = det 0 , this imposes det 0 = ω xd det 0 (and, hence, 1 = e i2π xd/P ) for all x ∈ [P], which is possible only if d ≥ P. This constraint is especially significant for experimental realizations, where coherently manipulating highdimensional target systems together with high-dimensional control systems is challenging [16]. For example, this limitation implies that, if the polarization of a single photon (d = 2) is used as the target system, the algorithm is useful only for P = 2, despite the fact that the spatial degree of freedom of the photon is amenable to encode much higher-dimensional control systems [38]. To overcome this, we next introduce another variant of the phaseestimation problem that is considerably less sensitive to the determinant constraint.

III. A NEW COMPUTATIONAL PRIMITIVE: THE HADAMARD PROMISE PROBLEM
We consider a different promise on the gates that the oracle O outputs. Given a known (P×P)-dimensional square matrix M P of entries m x,y = ±1, we require that the blackbox unitaries in U satisfy, for all x ∈ [P], the property that for some fixed, a priori unknown matrix column y ∈ [P]. The task is, again, to find y. In contrast to the complexphase relation of Eq. (2), the constraint that this real-phase relation imposes on d is much softer. As one can see taking the determinant of both sides of Eq. (4), the only requirement that arises now is that (m x,y ) d = 1 for all x, y ∈ [P], which is satisfied by any even d. With this, the promise problem finds application even when the target system is a simple qubit, regardless of the number of permutations P. Instead of a single complex phase factor, the value of y is now encoded in a string of P real phase factors (i.e., a column of M P ). The question, then, is how to decode that information. Luckily, the value of y can be mapped back onto the computational basis of c with a simple procedure, similar to that in Eq. (3), provided that M P is a Hadamard matrix [36].
A Hadamard matrix (of order P) is a (P×P)dimensional square matrix M P with entries m x,y = ±1 and whose columns (or, equivalently, whose rows) are all mutually orthogonal. The transpose M T P of M P is proportional to its inverse: (1/P)M P · M T P = 1, with 1 the identity matrix. Such matrices can only exist for P equal to 1, 2, or integer multiples of 4, and are conjectured to exist for all such dimensions. In fact, they can be generated recursively for any P = 2 k with k ∈ N. Here we are actually interested in the subset of Hadamard matrices with all +1s in the first row (x = 0) and column (y = 0). The former condition is required by Eq. (4), whereas the latter condition is necessary in our algorithm below for correct encoding (see Appendix A 1 for details). With this, we can formally rephrase this promise problem as follows. The algorithm to solve it with the quantum N -switch gate is similar to the Araújo-Costa-Brukner algorithm but with the quantum Hadamard gate H P associated to M P playing the role of F P . The matrix representation of H P in the computational basis is H P := M P / √ P. Then, the following algorithm solves Problem 1.

Algorithm 1. Initialize the joint system in the state
Finally, read out y as the outcome of a single-shot computational-basis measurement on c.
This algorithm thus provides the desired phase relation between the P different permutations of the N unknown unitaries under consideration. The validity of Eq. (5) is proven explicitly in Appendix A 1. The query complexity of the algorithm is the same as that of the Araújo-Costa-Brukner algorithm: Q = N for all P ≤ N !. The crucial resource for Algorithm III is the quantum N -switch process. Similarly to the Fourier promise problem [14], no causally ordered process is known to solve Problem 1 in general (i.e., for any arbitrary set U of unknown gates fulfilling the promise) with a query complexity linear in N . In fact, the (querywise) optimal causally ordered processes known to solve the problem in general are simply the fixed-gate circuits that simulate the quantum N -switch exactly (see Sec. VII), but these require considerably more queries [13,14,17]. For instance, in the case where all gate permutations are considered (P = N !), simulating the quantum N -switch exactly in the black-box 010320-4 scenario requires Q = (N 2 ) oracle queries, i.e., quadratically higher in N . Another concrete example is the quantum 4-switch process for the P = 4 permutations in the set {ABCD, BADC, CBDA, DACB} [shown in Fig. 1(b)], whose experimental implementation we describe below. The optimal circuit to simulate it exactly in the blackbox scenario requires Q = 9 oracle queries, i.e., more than twice as many as with S 4 (see Appendix A 2).

IV. EXPERIMENTAL QUANTUM CONTROL OF THE ORDER OF MULTIPLE GATE OPERATIONS
The experiment is illustrated in Fig. 2(a). It is based on multicore optical fibers and new related technology [28], which was recently introduced as a toolbox for quantum information processing [29][30][31]. In our implementation of the quantum 4 switch, the control system corresponds to the spatial mode of a single photon, while the target is its polarization. Following Algorithm III, a conventional illumination scheme (see Sec. VII) is used to generate single photons propagating over a single-mode fiber in the initial spatial mode state |0 c . The photons are then sent through a 4CF-BS, which has been shown to realize with high fidelity the H 4 = M 4 /2 Hadamard operation given by [39] Note that this matrix is self-inverse. The 4CF-BS is placed between commercial spatial multiplexer/demultiplexer units [40,41], which couple four single-mode fibers (yellow fibers) to the four cores of the multicore fibers (green ). An input photon is divided coherently between four spatial modes using a four-core-fiber beam splitter (4CF-BS), placed between commercial multiplexer/demultiplexer (DMUX) units, as shown in (b). The four output modes are then sent to the quantum 4-switch gate S 4 . Each spatial mode is related to a unique permutation of the four unitary polarization operations applied by S 4 and indicated by a different color. The photons enter through the IN side (right) and exit through the OUT side (left), where, for example, the notation "← A" means "from A" and "A ←" means "to A." One can follow a certain path by looking at the output labels. For instance, the green input mode enters in C and continues to "B, then D, then A, and finally exits," corresponding to the operation of the four polarization unitaries in the order CBDA. After S 4 , the four spatial modes are then recombined using a second 4CF-BS. Each output 0-3 is connected directly to a single-photon detector (APD). The detection of a single-photon in the yth (y = 0, 1, 2, 3) output detector identifies in a single shot the phase relation y of the four unitaries implemented in the quantum 4-switch gate. See the main text and Sec. VII for further details. 010320-5 fibers). These units connect to the 4CF-BS through the multicore fibers [see the details in Fig. 2 After transmission through the 4CF-BS, the photon is sent to the quantum 4-switch gate S 4 , which will coherently apply different permutations of four unitary operations U i on the target system (photon polarization), depending on the spatial mode. To see this, note that each output of the 4CF-BS routes the photon through a different ordering of the polarization operations U i , which are realized with controllable liquid crystal retarders (LCRs). To control the implementation order of the U i , we take advantage of the DMUX units. Each single-mode fiber input to the quantum 4-switch gate is connected to a different four-core fiber on the IN side of S 4 using a DMUX unit. The other end of each 4CF is attached to a fiber launcher. The photon leaves the launcher in free space passing through the LCR and is coupled back into another 4CF on the OUT side. The OUT 4CF is connected (via another DMUX) to single mode fibers, which are then connected to the next 4CF (exploiting the already installed DMUXs) back on the 4-switch's IN side, following the ordering showed in Fig. 2(a). For example, a photon in the green input undergoes the operation of the four unitaries in the order C → B → D → A, resulting in the product unitary 2 = U A U D U B U C . The other three inputs lead the photon through one of the other three permutations shown in Fig. 1(b). After S 4 , a second Hadamard operation is applied to the control system using a second set of DMUX+4CF-BS+DMUX, in accordance with Algorithm III. The setup is thus a fourarm interferometer with each output directly connected to an InGaAs APD, working in gated mode and configured with 10% overall detection efficiency, and 5 ns gate width. The detection of a single photon in the yth (y = 0, 1, 2, 3) output detector univocally identifies in a single TABLE I. Tables of polarization unitaries used for the implementations of two different quantum 4-switch gates (both with the same set of gate permutations {ABCD, BADC, CBDA, DACB}; here 1 is the identity, Z and X are the Pauli operators). For both tables, each column provides a different set U of oracle gates. In turn, each such set exhibits the phase relations encoded-via Eq. (4)-in the corresponding column y of the matrix in Eq. (6). That is, the implemented oracle gates fulfil the problem's promise with respect to the experimentally implemented Hadamard matrix and the chosen set of permutations.
shot the property y, indicating the phase relations of the four unitaries implemented in the quantum 4-switch gate. Before implementing the quantum 4-switch process, an initial alignment procedure using a polarimeter is performed. In-fiber polarization controllers (not shown in Fig. 2) are used in all single-mode fibers of the quantum 4 switch to ensure that every fiber corresponds to an identity operation on the polarization. They are also used at the final set of DMUX+4CF-BS+DMUX to guarantee the indistinguishability of the core modes, such that there is no path information available that would compromise the visibility of the interferometer [42,43]. The LCRs implementing the unitaries can be adjusted between identity and a half-wave plate by controlling the input voltage. In this way, we can toggle between an identity operation 1 and one of the Pauli operators Z, (Z + X )/ √ 2 or X , when the orientation angle of the LCR is 0 • , 22.5 • , or 45 • , respectively. Importantly, we note that the LCRs are placed at the far-field plane of the 4CF launchers

010320-6
and that this guarantees that the unitary operations U i are indistinguishable when applied in different orders (see Sec. VII). A computer-controlled field-programmable gate array (FPGA2) unit is used to control the LCRs.
In Table I we list the polarization operations U i for two different implementations of the quantum 4-switch process. Table 1(a) corresponds to orthogonal operations (for each given column), while Table 1(b) includes nonorthogonal operations, which makes it more difficult to mimic the quantum N -switch with a causally ordered process (see below and Appendix A 3). In each table, the yth column defines a different set U of the target-system unitary gates and corresponds to the yth column of the Hadamard matrix in Eq. (6) (see Sec. VII). In our experiment, by exploiting the controlled LCRs, we are able to toggle between the different sets U of unitaries in real time. In Fig. 3(a) we show an example of the results recorded while switching randomly (with uniform probabilities) between operations corresponding to different columns of Table 1(a), about every minute. In each 0.1 s measurement we detected a total of about 6000 events. In Figs. 3(b) and 3(c) we show a summary of experimentally obtained success probabilities (each obtained from about 3 × 10 4 events) to identify the relative-phase relations between the different permutations of the unitary operations in Table 1(a) and Table 1(b), respectively. For Table 1(a), we obtain an average success probability of p succ = 0.948 ± 0.005, whereas for Table 1(b), we obtain p succ = 0.959 ± 0.008. Error bars correspond to one standard deviation, and are obtained by error propagation of the Poissonian count statistics. These results demonstrate the successful implementation of the quantum 4-switch process.

V. BENCHMARKING EXPERIMENTAL QUANTUM CONTROL OF MULTIPLE GATE ORDERS
To benchmark the realization of QCGO, it is useful to imagine a verification scenario, in which a verifier controls the oracle, while the process is implemented by a prover. The prover wishes to prove to the verifier that the process does display QCGO, and the verifier can test this by asking the prover to compute properties of oracles involving different gates. The quantum N -switch process allows the prover to solve the computations with considerably fewer oracle queries than any process with fixed (or classically controlled) gate connections. Indeed, it is the only process known to provide a unit success probability for Problem 1 in general (i.e., for any set of black-box gates satisfying the promise) with only N queries to the oracle. This can be used to give the verifier evidence in favor of the prover's honesty. However, if the table of oracle gates has a small number of columns-e.g., as in Table I-a  dishonest prover with side information about the table   can attain p succ = 1 with a causally ordered process (see Appendix A 3), thus deceiving the verifier.
One way to benchmark experimental quantum switches with minimal assumptions is by measuring so-called causal witnesses [2,44]. Interestingly, by increasing the number of columns in the oracle-gate table (i.e., of possible choices for the gate sets U) and suitably choosing their prior probability distribution, Algorithm III can be turned into a causal witness for the quantum switch. That is, for sufficiently large oracle-gate tables and an appropriate prior distribution for the gate sets U, an upper bound p CCGO succ strictly smaller than 1 can be found for the probability of success attainable by processes with classical control of gate orders (CCGO). This provides us with a gap from the probability of success obtained by the quantum switch, which always remains unity in the noiseless case. Details on our search for witnesses are given in Appendix A 4.
Unfortunately, the number of measurement settings required to measure such witnesses is prohibitively high in practice for this experimental setup. For instance, the best witness for W 4 we can obtain with the abovementioned approach gives p CCGO succ ≈ 0.89, but requires an oracle-gate table with 300 columns. Alternatively, weaker witnesses with p CCGO succ ≈ 0.92 can also be found, but these still require 60 columns. Our LCR-based setup cannot switch among so many gates in a practical way. Nevertheless, it is yet a remarkable feature of our experiment that we do reach values of p succ significantly higher than both bounds, which would conclusively benchmark W 4 for a higher number of settings. In addition, we note that witnesses with similarly high numbers of settings (259) have indeed been measured in other platforms, though with much slower switching times [19].
Alternatively, smaller oracle-gate tables suffice if the verifier can actively reduce the prover's potential knowledge about the tables. One way to do this is by allowing the verifier to apply a random basis rotation to each gate before delivering it to the prover. For instance, in this scenario, an upper bound p CCGO succ ≈ 0.84 can be obtained for an oracle-gate table with only 30 columns (see Appendix A 4). Unfortunately, implementing such a causal witness would require the ability to switch among a continuum of gates, which is again experimentally infeasible. Nevertheless, here we are mainly interested in benchmarking our implementation of W 4 against experimental imperfections, rather than against hypothetical malicious provers exploiting side information about the gates' bases. In this regard, the experimentally obtained values in Fig. 3 are in the range p succ ≈ 0.93-0.97, which suggests that our setup should be capable of obtaining average success probabilities that are larger than the thresholds mentioned above, for a larger number of settings.
Though not yet conclusive, this provides encouraging evidence for the QCGO of the implemented process. 010320-7

VI. DISCUSSION
Here we introduce the "Hadamard promise problem," a novel computational primitive involving the relative phases between different permutations of multiple unknown gates. We present an algorithm to solve it efficiently, illustrating a quantum computational advantage associated to the coherent quantum control of the order in which a sequence of N unitary operations is applied. Our algorithm, which we implement experimentally for N = 4, exploits the quantum N -switch process to solve the problem with N applications of the unitary gates, whereas the known methods exploiting fixed gate orders use the gates O(N 2 ) times. Both the problem and algorithm have the advantage that the target system needs only be two dimensional, as opposed to N ! dimensional as in previous proposals. This could inspire new approaches for exploiting indefinite causal order in quantum computation and communication, as well as for studying causal nonseparability in physical systems.
We experimentally implement the algorithm by constructing a quantum 4-switch process that coherently controls four different gate orderings with high fidelity, showing success probabilities for the algorithm of approximately 0.95. The all-optical setup involves a four-path interferometer constructed with new multicore optical fiber technology. As discussed in Sec. VII, the best-known quantum circuit with fixed gate orders solves this problem with 9 gate queries. Our experiment thus corresponds to a five-query improvement. Moreover, this is, to the best of our knowledge, the first report of a quantum superposition of more than 2 temporal orders. In addition, our implementation presents some technical advantages as well. On the one hand, it is versatile in that the gate orders can be modified in a practical fashion by switching the optical fiber connections and that the unitary gates themselves can be automatically controlled through the liquid crystal polarization retarders. On the other hand, the setup can be scaled up to higher control-system dimensions in a straightforward fashion. This work constitutes a key step towards realizing and verifying causal nonseparability among a large number of parties, and should play an important role in developing methods to exploit this resource.

A. Query complexity analysis
One may argue that implementing S N is not the only way to solve Problem 1 (which is also true for the Fourier promise problem [14]). Here, we estimate the query complexity of other plausible approaches.
A natural approach one may attempt is to tomographically reconstruct the N unitary gates and then multiply them to estimate the x , from which one can infer y.
Since each x is an N -fold product of the U i , the overall error ε in its estimation is ε = (N ), where is the statistical error of the reconstruction of each U i . To attain a constant overall error, one thus needs = O(1/N ), which, by virtue of Hoeffding's bound, in turn requires q = O(1/ 2 ) = O(N 2 ) queries to each U i . Moreover, since there are N gates to reconstruct, the overall query complexity is Q = O(Nq) = O(N 3 ), i.e., cubically worse in N than with the quantum N -switch. Another alternative is to tomographically reconstruct each x directly, and from that infer y. However, to query each N -fold product x , one must query all N unitaries, and there are P such products. Hence, the overall query complexity is if one considers P ≥ N (as we did in our experimental demonstration), i.e., quadratically worse in N than with the quantum N -switch. A third possibility could be to directly estimate the signs of the commutators between the x , and from that infer y. A canonical tool for that is the well-known Hadamard test [36]. This allows one to estimate overlaps of the form x | t directly from queries to x or x and x , respectively, for any state | t . As before, each query to x accounts for N queries to the gates, and the overall query complexity is again Finally, one can simulate S N exactly with a circuit with fixed gate orders. For the usual case where all P = N ! permutations are considered, the optimal causally ordered circuit that synthesizes S N in the black-box scenario displays complexity Q = (N 2 ) [13,14,17]. For the concrete case experimentally studied here, P = N = 4, the optimal causally ordered circuit that synthesizes S 4 requires 9 queries (see Appendix A 2). In fact, this is the reason why we choose the particular permutation set {ABCD, BADC, CBDA, DACB}. Through a brute-force search, we find that, from all quartets of permutations, most of them require 7 queries or less with the simulation strategy presented in Appendix A 2, some other 8 queries, and a few of them (including the one chosen here) require the maximum of 9 queries. Thus, the specific version of the quantum 4-switch process implemented here provides a gap of 9 − 4 = 5 queries with respect to all causally ordered processes.

Single-photon source
The single-photon light source is composed of a semiconductor distributed feedback telecom laser (λ = 1546 nm) connected to an external fiber-pigtailed amplitude modulator (Mach-Zehnder interferometer, MZI). An FPGA unit (FPGA1) is used with the MZI to externally modulate the laser and generate optical pulses 5 ns wide. Optical attenuators (ATTs) are used before the MZI to create weak coherent states with a mean photon number per pulse of μ = 0.2. In this case, 90% of the 010320-8 non-null pulses generated contain a single photon. Thus, our source is a good approximation to a nondeterministic single-photon source, which is commonly adopted in quantum communications [45]. FPGA1 also controls the active phase stabilization of the system and registration of single-photon counts at each of the four detectors during the measurement procedure (see below).

Indistinguishability of the multigate operations in different orders
The four unitary operators U i (i = A, B, C, D) are realized using birefringent liquid crystal retarders. An important aspect of the experiment is to guarantee the realization of the same unitary operation U i for all different orders considered. That is, the implementation of U i must be independent of the illuminated core on the corresponding 4CF at the IN side of the oracle. To achieve this, the LCRs are placed in the Fourier plane of the objective lenses of the 4CF fiber launchers [see Fig. 4(a)]. At the exit face of this fiber, the output single mode of each core is given by a Gaussian function g( r) centered at the core position r c . At the Fourier plane of the launcher lens, the spatial distribution of each core is given by the Therefore, irrespective of the illuminated core, all core modes overlap at the same central point with the intensity proportional to |g( s)| 2 . This avoids spatial distinctions as in certain implementations for N = 2 gates [18,19]. To guarantee this condition for our experiment, we used a CCD camera to record the intensity distributions at the Fourier plane (with the LCRs removed), as shown in Fig. 4(b). The images, obtained with an intense laser, show the centering of the light distribution when a single core is connected. The resulting interference pattern when all cores are illuminated shows high visibility, confirming spatial indistinguishability. This guarantees that the unitary operations U i are indistinguishable when applied in different orders-a crucial requirement for a valid implementation of an N -switch [20].

Phase stabilization and measurement procedure
Phase (PHASE MOD) and intensity modulators (INT MOD) are used after the first 4CF-BS, on each arm of the interferometer [see Fig. 2(a)], to set the relative phases between the four spatial modes to zero, and to adjust the amplitudes. The FPGA1 unit is used to implement a control system to actively compensate phase drifts in the quantum 4-switch process. The control is based on a perturb and observe power point tracking method [39,46]. Basically, the phase drift compensation algorithm will perturb the kth phase modulator to cancel any phase noise using a highspeed signal. The algorithm does this sequentially to each phase modulator and in each step it maximizes the number of photocounts in the output detector "0" with the LCRs set to realize column y = 0 of one of the tables in Table I. When the counts achieve a given threshold value for the success probability, the voltages applied to the phase modulators are maintained constant, and an ON signal is sent to FPGA2 to activate the LCRs by applying a constant voltage, realizing any one of the four columns of the respective table in Table I, chosen by the user. After a 0.2 s deadtime to allow for the LCRs voltages to reach the desired value, a 0.1 s measurement stage is realized. After a single measurement window, an OFF signal is sent to return the LCRs to column 0. In this way, we can switch rapidly between columns 0-3 of the tables. The control system monitors the phase stabilization of the interferometer in real time after every measurement.
We have used this phase stabilization routine in other work [39], and obtained visibilities over 99%. Here, our success probability is limited to about 95% due to slightly imperfect polarization rotations of the LCRs, as well as the difficulty in achieving proper alignment of the polarization state for the different LCR combinations in each path, which we observed in the initial alignment procedure using the polarimeter (see Sec. IV). work is also supported by Fondo Nacional de Desarrollo Científico y Tecnológico (ANID) (Grants No. 3200779, No. 1190901, No. 1200266, and No. 1200859) and ANID -Millennium Science Initiative Program -ICN17_012. J.C. is supported by ANID/REC/PAI77190088. A.A. is supported by the Swiss National Science Foundation (Starting Grant DIAQ and NCCR SwissMAP). M.A. received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Grant No 801110 and the Austrian Federal Ministry of Education, Science and Research (BMBWF). The paper reflects only the authors' views, the EU agency is not responsible for any use that may be made of the information it contains.

Proof of Eq. (5)
First, note that (just like the Fourier transform) the Hadamard gate H P maps |0 c to the uniform superposition of all computational-basis states (under the assumption that the corresponding Hadamard matrix M P only has +1 values along the first column): Then, the quantum N -switch gate introduces the sign m x,y to each computational-basis state |x c in the superposition where the second equality follows from Eq. (4). Now, by definition, the state within the brackets is H P |y c . Hence, applying H −1 P to both sides of Eq. (A2) yields Eq. (5).

Exact simulation of the quantum N -switch with a fixed-gate-order circuit
It is possible to simulate the quantum N -switch-i.e., produce the same superposition of unitaries { x } x∈ [P] as the quantum N -switch for whatever unitaries U i are inserted at its open slots-with a causally ordered circuit at the cost of making more uses (queries) of each unitary. The basic idea behind such a circuit is to apply the unitaries coherently controlled by a qudit. However, this is not a straightforward task with black-box unitaries [26,[47][48][49][50][51]. A workaround is to use ancillas and controlled swap gates that coherently control whether each target-system gate is effectively applied to the target system or to an ancilla. This can be done with a circuit such as in Fig. 5, which uses a P-dimensional control qudit and N d-dimensional ancilla systems (one for each gate U i ). Importantly, as the reader may verify, all N ancillas experience the same overall gate sequence for all input states of the control register, which guarantees that the ancillas disentangle from the target and control systems by the end of the circuit. For instance, FIG. 5. Fixed-gate-order circuit that simulates the quantum 4-switch process that is realized experimentally, i.e., with quantum control of the four gate sequences 0 = U D U C U B U A , 1 = U C U D U A U B , 2 = U A U D U B U C , and 3 = U B U C U A U D . Before and after each unitary U i , a pair of controlled swap gates controls whether U i is applied to the target system or to an ancilla; the control qudit has dimension P = 4, here represented as two qubits (with x = 0, 1, 2, and 3 encoded as 00, 01, 10, and 11, respectively). Filled circles indicate an operation conditioned on the |1 c state; open circles indicate an operation conditioned on the |0 c state. Conditioning on negation of certain states is also needed, as exemplified in the legend below the circuit. 010320-10 for the circuit in Fig. 5, the final state of the ancillas is U 2 A |0 anc,A U B |0 anc,B U C |0 anc,C U D |0 anc,D .
With this circuit scheme, the problem of simulating the superposition of unitaries produced by a quantum Nswitch reduces to finding a supersequence that includes all the desired permutations as subsequences; the query complexity of this scheme is then given by the length of the shortest such supersequence [17,52]. In the experiment and Fig. 5, ACBADACDB is the supersequence to the quartet of permutations {ABCD, BADC, CBDA, DACB} (note that the subsequences need not be contiguous). We have made an extensive numerical search of all quartets of permutations of A, B, C, D. There are (N ! − 1 P − 1) = (23 3) = 1771 unique quartets, where quartets that differ only by relabeling are disconsidered (this amounts to, for instance, only considering quartets that include some fixed permutation, e.g., ABCD). Of those, most require a supersequence of length 8 or less (37 unique quartets require length 6; 946 require length 7; 779 require length 8) and only 9 require length 9. Since the higher the supersequence length, the higher the query complexity of the simulation by fixed-gate-order circuit, we chose one of the latter nine quartets for our experiment (as well as Fig. 5). Note that all nine black boxes are queried once, irrespective of whether they are effectively used in the superposition or not; hence, the query complexity of this simulation of the quantum 4-switch process is 9.

Fixed-gate circuit algorithms for the Hadamard promise problem exploiting side information about the gates
Let us revisit the adversarial scenario of a verifier who controls the oracle and poses the Hadamard promise problem to a prover. The prover thus receives unknown (to them) unitaries and uses them to the best of their abilities to solve the problem and output the correct answer to the verifier. As we showed, a prover in possession of a quantum N -switch can solve the problem with 100% success rate using only a single query from each unitary. We now ask: can a prover solve the problem with access only to fixed-gate-order circuits?
By performing the simulations in the previous section, they are also able to solve the Hadamard promise problem with 100% success rate. However, they must request additional queries of the oracle to the verifier, a tell-tale sign to the latter that the quantum N -switch has not been realized.
We now explore the case of a prover with side information on the unitaries from the oracle. More specifically, let us suppose that they know the table of unitaries that the verifier uses (Table I), but not which column is selected in each run. This information aids the prover, who may no longer need to produce the superposition of unitaries from the previous section.
If Table 1(a) is used, the prover's strategy is relatively simple. By inputting a |+ := (|0 + |1 )/ √ 2 state to black box U A , the output state will be either |+ if With a measurement of the output in the X basis, they can identify U A (we call this an X -basis test on U A ). Doing the same procedure on U C , they identify this unitary as well and discover the column y of Table 1(a) being used. Since only 1 query or less of each unitary is needed, the prover can in fact deceive the verifier in this case.
If instead Table 1(b) is used, the prover requires a slightly more complex fixed-gate-order circuit to deceive the verifier. It begins with an X -basis test applied to U C , which reveals the content of that black box. In turn, U D is revealed with an analogous Z-basis test, with input state |0 and measurement of output in the Z basis. If one of these two black boxes is revealed to be a Pauli operator (Z or X ) then that run of the promise problem has been solved (y = 1 or 3, respectively). However, if both U C = 1 and U D = 1, both y = 0 and y = 2 are possible, and the black boxes U A , U B need to be used. Since the quantum N -switch finds the correct value of y with probability 1, so is the goal of the prover here. However, the two possible unitaries for are not orthogonal, i.e., not perfectly distinguishable, and the same happens with U B . No independent use of U A and U B can tell the columns apart with certainty. There is a viable strategy, though, using U A and U B in sequence. Indeed, note that U B U A = 1 for column 0 and U B U A = −iY for column 2. A Z-or X -basis test applied to the sequence of the two unitaries U A and U B can distinguish these two possibilities, again solving the problem with certainty.
If the prover does not know whether the verifier uses Table 1(a) or 1(b), the former needs to first identify which table is used. This table identification can be done with a Zbasis test on U D , which reveals whether U D = X or U D = 1. The strategy for Table 1(a) is applied in the former case, that for Table 1(b) in the latter case (note that column y = 3 is the same for both tables).

Causal witnesses for the 4-switch process
In order to certify, via the Hadamard promise problem, that a given process exhibits some QCGO, one may look for the maximal probability of success p CCGO succ that processes with CCGO can reach: if this upper bound is strictly smaller than 1, it becomes possible to experimentally obtain a probability of success p succ > p CCGO succ and thus prove that these results cannot be explained by CCGO.
For a fixed choice of gate permutations and of the Hadamard matrix under consideration, the "causal bound" p CCGO succ still depends on the specific choice of possible sets U, and of the prior distribution with which each set is chosen in each experimental run. Considering different possible sets U k , each satisfying the promise of Eq. (4) 010320-11 for some value y = y k and chosen with probability q k , the probability of success (i.e., of obtaining the correct value y = y k ) of the Hadamard promise problem is obtained as To compute the above probabilities, and to obtain the causal bound p CCGO succ , we use the so-called "process matrix framework" [1]. In this framework the process under consideration (i.e., in our case, the circuit that connects the four unitaries and the final measurement) is described by the "process matrix" W, acting on the tensor product of all input and output Hilbert spaces of the four unitaries and of the final measurement. When the four qubit unitaries from some quartet D } are applied, the probability Prob(y = y k | U = U k ) that the final measurement in the computational basis {|y c } y∈ [4] of H c gives the outcome y k for an arbitrary process matrix W is obtained as The process matrix describing the ideal 4-switch process of Fig. 1(b) is given by ) and the superscripts indicate the Hilbert spaces in which the various states are defined: c p , c f refer to the past and the future of the control system, t p , t f refer to the past and the future of the target system, A I and A O refer to the input and output spaces of operation U A , and similarly for the other parties. Note that, for the sake of clarity, in Fig. 1(a) we use a simplified notation based on the necessary isomorphism between t p , t f , A I , A O , and the other parties' inputs and outputs (as well as between c p and c f ). In Algorithm III we input the initial control state H P |0 into c p , the initial target state | into t p , and apply H −1 P to the resulting state of the control system in c f . These fixed steps can be incorporated into the processmatrix description. The resulting matrix that describes our effective process is then W 4 = Tr t f |w 4 w 4 | with Using this process matrix, we can verify that, for any (4) for some y = y k , one has Tr[(|U k U k | ⊗ |y k y k | c )W 4 ] = 1, so that the success probability of Algorithm III, using the 4-switch process, is indeed unity.
Processes that display CCGO, on the other hand, are described by process matrices from a particular subset of all possible matrices, with some specific structure. In Ref. [54], it was indeed shown that (in our scenario, with four operations and a final measurement) CCGO process matrices W must have a decomposition of the form in terms of positive semidefinite matrices (not necessarily valid process matrices) W (i, j ,k,l),c , for all 4! = 24 permutations (i, j , k, l) of {A, B, C, D} (hence with i = j = k = l). These must furthermore be such that the "reduced" matrices W (i, j ,k,l) := Tr c W (i, j ,k,l),c (where c refers to the space of the final measurement), W (i, j ,k) := Tr l W (i, j ,k,l) (where Tr l corresponds to the partial trace over the input and output spaces of the operation U l ), W (i, j ) := k Tr k W (i, j ,k) , and W (i) := j Tr j W (i, j ) are of the form for some matrices W (·) in the appropriate spaces. Here 1 l O denotes the identity operator on the output space of the operation U l , and similarly for 1 k O , 1 j O , and 1 i O .
To obtain the causal bound p CCGO succ for all CCGO processes-for a fixed choice of sets U k and weights q k , and hence a fixed operator G as defined in Eq. (A6)-one can then optimize the value of p succ = Tr[GW] for all W in 010320-12 the class described by Eqs. (A9)-(A10) (which describes a closed convex cone, which we denote by W CCGO ) and with the additional normalization condition [1,3,55] that Tr W = 2 4 : such thatW ∈ W CCGO , TrW = 2 4 . (A11) As it turns out, this optimization is a semidefinite programming (SDP) problem, which can in principle be solved faithfully [2,44,55]. Another possible "dual" approach-now just for a fixed choice of possible sets U k -is to optimize the causal witness rather than the process matrix. Fixing the witness to be of the form of G in Eq. (A6) allows us to optimize the weights q k for each U k : indeed, the optimization problem can be written here (see Appendix H of Ref. [2]) as is the convex cone dual to W CCGO , which can, like the latter, be described in terms of SDP constraints [2,44,55]. Let us note here that the above characterization of W CCGO [via the decomposition of Eq. (A9), with the matrices W (·) satisfying the constraints of Eq. (A10)] was shown [55] to be a sufficient condition for the process matrix to be "causally separable" [1,3,55]. It remains an open question whether or not the class of causally separable processes is strictly larger than that of CCGO. We nevertheless conjecture that the "causal bounds" p CCGO succ we obtain here hold for all causally separable processes.

a. Causal witnesses with finitely many settings
We are able to solve the simpler SDP problem of Eq. (A16) using the 460 sets of unitaries from G with the splitting conic solver [56,57], obtaining a bound of p CCGO succ ≈ 0.92. We then progressively set to zero the smallest weights and solve the SDP again, eventually reaching 60 nonzero weights with no change in p CCGO succ within numerical precision (36 corresponding to sets satisfying the promise for y = 0, 12 for y = 1, and 6 each for y = 2 and y = 3).

b. Causal witnesses with random rotations
The causal strategies described in Appendix A 3 exploit knowledge of the basis that the unknown unitaries are defined in. A possibility to obtain better bounds on p CCGO succ is therefore to allow the verifier to provide the unitaries in an unknown basis. Given a set U = {U A , U B , U C , U D }, this corresponds formally to providing the operations U (V) = {VU A V † , VU B V † , VU C V † , VU D V † } for some unknown unitary V. Note that if U obeys the promise of Eq. (4) then so does U (V) .
To construct better causal witnesses from this approach, we start as before with a fixed choice of sets U k and then, in addition to choosing U k with prior probability q k , we randomly choose an unknown unitary V to be applied according to the Haar measure. Equation (A6) then becomes where μ(V) is the normalized Haar measure over SU (2). SDP problems (A11), (A12), and (A15) can then be solved in the same way as described above. We again consider the 460 sets of unitaries U with each U i ∈ G as in the previous section. The integration over the Haar measure can be performed analytically by taking an explicit parameterization of SU(2) unitaries. However, since the |U (V) k U (V) k | are 2 8 × 2 8 matrices, this procedure is nevertheless slow, even with automated symbolic integration using, e.g., Mathematica. To simplify matters, we exploit that fact that the Haar measure is unitary invariant [i.e., d(V) = d(UV) = d(VU) for any unitary U], so sets U and U that are equivalent up to a change of basis give dμ(V)|U (V) We thereby find that there are 98 sets U that are inequivalent in this way and that satisfy one of the properties y k .
Considering witnesses constructed from these sets, we solved the dual SDP problem given in Eq. (A16). For CCGO processes, we find the bound p CCGO succ ≈ 0.84. Interestingly, we find that the same bound can be reached by considering the Haar randomization only over witnesses using sets U containing only Pauli matrices, rather than from the full set G. Indeed, this bound can be obtained by randomizing over the 30 sets U given in Table II that we found to have nonzero weights in the optimal witness we obtained.

c. Derandomization
In order not to require the assumption that the prover does not know in which basis the verifier provided each set U, one could derandomize the approach above by using a weighted quantum t design [58]. This is a finite set of unitaries X together with weights w such that the average of any operator over them is equal to its average over the Haar measure up to order t. Since |U (V) k is an eighthorder expression on V, an 8 design allows us to reproduce exactly the witness with bound p CCGO succ ≈ 0.84 with a finite, fixed set of unitaries. Unfortunately, t designs are rather large. It can be shown that the smallest size |X | of a weighted 8 design for unitaries of dimension 2 is bounded by 165 ≤ |X | ≤ 968, so the resulting witness would have at least 4950 settings, and therefore be of little relevance for experiments [59].
In order to obtain smaller witnesses, we instead sampled five random qubit unitaries from the Haar measure, and conjugated all 30 columns of Table II with these fixed unitaries, obtaining a witness using 150 settings.
Solving the SDP in Eq. (A11) with the splitting conic solver, fixing the prior probability of choosing each set U (V) k to be q k /5, where q k is the optimal weight obtained for the full randomization of U k in the previous section, we obtained p CCGO succ ≈ 0.96. By further optimizing over all   TABLE II. Table of 30   U A 1 1 1 Z 1 1 Z 1 Z Z 1 Z Z Z Z Z 1 Z Z Z Z Z 1 Z Z Z 1 Z Z Z U B 1 1 Z 1 1 Z 1 Z 1 Z Z 1 Z Z Z X Z 1 Z X X X Z 1 X Z 1 X X X U C 1 Z 1 1 Z 1 1 Z Z 1 Z Z 1 Z Z X X 1 X Z X Y X Z 1 X Z 1 Z Y U D Z 1 1 1 Z Z Z 1 1 1 Z Z Z 1 Z Z 1 X X X Y Z Z X 1 Y X X 1 Y 010320-14 150 weights using the dual SDP, we find that this can be improved to p CCGO succ ≈ 0.93. By using more than five random unitaries, this bound can be lowered further. For example, with 10 random unitaries we are able to obtain p CCGO succ ≈ 0.89 (when optimizing over all 300 weights).