Event-Ready Bell Test Using Entangled Atoms Simultaneously Closing Detection and Locality Loopholes

An experimental test of Bell's inequality allows ruling out any local-realistic description of nature by measuring correlations between distant systems. While such tests are conceptually simple, there are strict requirements concerning the detection efficiency of the involved measurements, as well as the enforcement of spacelike separation between the measurement events. Only very recently could both loopholes be closed simultaneously. Here we present a statistically significant, event-ready Bell test based on combining heralded entanglement of atoms separated by $398\,\mathrm{m}$ with fast and efficient measurements of the atomic spin states closing essential loopholes. We obtain a violation with $S=2.221\pm0.033$ (compared to the maximal value of 2 achievable with models based on local hidden variables) which allows us to refute the hypothesis of local-realism with a significance level $P<2.57\cdot10^{-9}$.

An experimental test of Bell's inequality allows ruling out any local-realistic description of nature by measuring correlations between distant systems. While such tests are conceptually simple, there are strict requirements concerning the detection efficiency of the involved measurements, as well as the enforcement of spacelike separation between the measurement events. Only very recently could both loopholes be closed simultaneously. Here we present a statistically significant, event-ready Bell test based on combining heralded entanglement of atoms separated by 398 m with fast and efficient measurements of the atomic spin states closing essential loopholes. We obtain a violation with S = 2.221 ± 0.033 (compared to the maximal value of 2 achievable with models based on local hidden variables) which allows us to refute the hypothesis of local-realism with a significance level P < 2.57 · 10 −9 .
PACS numbers: 03.65.Ud, 32.80.Qk Back in 1935 Einstein, Podolsky and Rosen (EPR) pointed at inconsistencies in quantum mechanics, if one requires that a physical theory has to be realistic and local [1]. In such theories any signal, influence, or interaction propagates at most at the speed of light (locality), and one can assign properties to quantum systems before a measurement (realism). To achieve the latter, they left open the possibility to complement quantum mechanics with, nowadays called, local hidden variables (LHV). Starting from the EPR example on analyzing measurement results of two independent observers, John Bell showed that the prediction of QM for certain measurement scenarios differ from the prediction of all local, realistic theories [2]. With this he directly provided a prescription for how to evaluate the validity of the EPR claims and of any LHV theory in an experiment.
However, there are stringent requirements on an experimental test, as LHVs give a theory an amazing flexibility to account for observed results. In spite of the many experiments started soon after Bell's discovery (e.g. [3,4]), which (almost) all agreed well with QM, they all relied on assumptions on the observers or the observed systems, thus opening loopholes to the LHV theories under test (for reviews see, e.g., [5][6][7]).
One loophole, the locality loophole, concerns the independence of the observers, which only can be warranted if the whole measurement processes of the two observers are spacelike separated. This was achieved by Weihs et al. [8], where the whole measurement, starting from the choice of a random number up to the appearance of the classical voltage signal of a single photon detection was outside the light cone of the other measurement. However, as detection of single photons was notoriously inefficient those days, one had to assume fair sampling, i.e. that the registered photon pairs had been a representative sample of all pairs -thus leaving open the so called de-tection loophole. This was closed for the first time in an experiment using trapped, entangled ions [9], which, however, were separated only by few micrometers -leaving the locality loophole open. Since then the goal was to close both in a single experiment, leading to key developments such as the first observations of atom-photon entanglement [10,11] and atom-atom entanglement over larger distances [12,13]. Recently, based on electron spins of separated nitrogen-vacancy (NV) centers [14] the first experimental test of Bell's theorem without the locality and detection loophole was performed [15]. With the development of efficient photon pair sources [16] and highly efficient single photon detectors [17] two tests succeeded also with entangled photon pairs [18,19].
Here we describe the evaluation of LHV theories using entangled neutral atoms closing both the locality and the detection loophole in a single experiment. Based on atomphoton entanglement, entanglement swapping [21] allowed to prepare in a heralded manner entangled spin states of two atoms separated by a distance of 398 m, well suited for an event-ready test. For an event-ready test no fair sampling assumption has to be made [21,22]. There a measurement result is reported every time the heralding signal confirming the successful distribution of entanglement to the observers was obtained and thus no detection loophole is opened at all. Any inefficiencies or inaccuracies in the atomic state detection then only influence the degree of achievable correlations. The locality loophole is closed by employing fast and efficient measurements of the atomic spin states at a sufficient distance together with fast quantum random number generators (QRNG) for selection of the measurement basis. We employed state-dependent ionization for highly efficient atomic state analysis and with a total observation time of about a microsecond also the spacelike separation could be warranted. Well-defined hypothesis tests with samples of 10000 observations clearly arXiv:1611.04604v2 [quant-ph] 16  indicate that LHV theories do not allow a correct description of nature.
We consider the simplest situation of an event-ready Bell test, where two separate observers are told -according to a heralding signal -to report the result of two-outcome measurements A, B ∈ {↑, ↓} performed on each side (an example are measurements on spin-1 2 particles). For a test of local realism the two observers choose their measurement directions from two possibilities a ∈ {α, α } and b ∈ {β, β } and afterwards compare their results. For this situation Clauser, Horne, Shimony, and Holt (CHSH) put Bell's inequality in an experimentally friendly form [23]: a,b denote the number of events with the respective outcomes A, B for measurement directions a, b and N a,b is the total number of events of the respective measurement setting. Quantum mechanics predicts a violation of this inequality when measurements are performed on maximally entangled states |Ψ ± = 1 √ 2 (|↑ |↓ ± |↓ |↑ ) with certain measurement settings, e.g., α = 0 • , α = 90 • , β = −45 • , β = 45 • . Angles α, β are defined here in the spin space.
In our case the two observer stations are independently operated setups (trap 1 and trap 2) that are equipped with their own laser and control systems. Their separation of 398 m (Fig. 1) makes 1328 ns available to warrant spacelike separation of the measurements. On each side we store a single 87 Rb atom in an optical dipole trap. The employed internal spin states (|↑ z and |↓ z ) are the Zeeman states |m F = +1 and |m F = −1 of the ground level 5 2 S 1/2 , F = 1 ( Fig. 2(a)). Entanglement of the atoms is generated by first entangling the spin of each atom with the polarization of a single emitted photon [11]. The photons are guided to an interferometric Bell state measurement (BSM) setup (Fig. 2), located close to trap 1. It consists of a fiber beam splitter (BS) followed by polarizing beam splitters (PBS) in each of the output ports, where detection of photons is performed by four avalanche photodiodes (APDs). This setup allows to distinguish two maximally entangled photon states. Thereby a two-photon coincidence in particular detector combinations (see Sec. I.B of the Supplemental Material [24], which includes Refs. [25][26][27][28][29][30][31]) heralds the projection of the atoms onto one of the states |Ψ ± = 1  Figure 2. (a) Scheme of the atomic levels involved in the entanglement between the spin state of the atom (subspace 5 2 S 1/2 , F = 1, |m F = ±1 ) and polarization of the photon (left-and right-circular, |L , |R , respectively). Entanglement is generated in the spontaneous decay of the 5 2 P 3/2 , F = 0 state after optical excitation. (b) Scheme of the atomic state measurement. A selected superposition of the spin states is excited to the 5 2 P 1/2 , F = 1 level depending on the polarization of a 795 nm laser pulse and is ionized with a 473 nm laser. The atom can spontaneously decay to the 5 2 S 1/2 , F = 1 or F = 2 levels during this procedure (gray wavy arrows). While decays into the F = 1 level can reduce the fidelity of the measurement process, population in the F = 2 level is excited with an additional 780 nm laser and ionized as well. (c) Schematic of the experimental setup. In each trap spin-polarization entanglement is generated between the atom and a single photon which is guided to the BSM via a single-mode fiber. Polarization stability in the 700 m fiber connecting trap 2 and the BSM is ensured by automatic compensation [32] performed every 5 min using reference light and a polarization controller. The photons are overlapped on a fiber beam splitter (BS), their coincident detection heralds entanglement of the atomic spins. Local measurements are performed on the atomic spins according to settings selected by quantum random number generators (QRNGs). AOM: acousto-optic modulator, APD: avalanche photo diode, CEM: channel electron multiplier, FPGA: field programmable gate array, PBS: polarizing beam-splitter.
loaded into the traps. Photons emitted by the atoms are coupled into optical fibers. The efficiencies for detecting a single photon in the BSM arrangement after excitation in trap 1 or trap 2 are η 1 = 1.65×10 −3 and η 2 = 0.85×10 −3 (the latter also includes the transmission loss of photons (λ = 780 nm) in the 700 m fiber of approximately 50%). This results in an overall probability to obtain a heralding signal in the BSM of 0.7 × 10 −6 . If no signal is obtained the excitation sequence of the atoms is repeated. Including times necessary for transmission of signals as well as to prepare and to cool the atoms, the average rate of excitation attempts is 5.2 × 10 4 s −1 . Depending on the loading rate of the traps this results in about 1 -2 heralding events per minute. The atom excitation procedures are synchronized to < 1 ns (Supplemental Material [24] Sec. I.A) such that the emitted photons entangled with the respective atoms have, at the BSM setup, a temporal overlap close to unity [13]. After a successful BSM signals are sent to both observers where they trigger the switching to atomic state measurement. An additional waiting time has to be introduced due to dephasing and rephasing of atomic states in strongly focused dipole traps. There, longitudinal field components lead to an inhomogeneous light polarization which results in a state-and position-dependent AC Stark shift. Due to the antisymmetry of the polarization distribution this accumulated phase is compensated after one transverse os-cillation [33]. To obtain simultaneous rephasing the radial trap frequencies are chosen for an oscillation period 2π ωr of 11.2 µs and 14.5 µs for trap 1 and trap 2, respectively, by setting the trap depths. The measurement procedure starts with selecting the analysis direction according to the output of a fast quantum random number generator. As a further development of [34] these QRNGs have minimal bias (typ. less than 10 −5 ) without any postprocessing [35]. The random bit in trap 1 (trap 2) determining the direction α/α (β/β ) is provided on request and has no measurable correlation to bits generated earlier than 80 ns before, see the Supplemental Material [24] Sec. II for details. In the sense of independence to previous information, we thus consider this moment before the request as the starting time of the measurement.
For the analysis of the atomic state a state-selective ionization is employed where the measurement direction γ ∈ {α, α , β, β } is determined by the polarization of a readout laser at 795 nm exciting the atom to the 5 2 P 1/2 , F = 1 level from where it is ionized by an additional laser at 473 nm ( Fig. 2(b)). In particular, we ionize the state |↑ γ = sin(γ/2) |↑ x − cos(γ/2) |↓ x using linear polarization at an angle γ/2 relative to the horizontal. The state |↓ γ = cos(γ/2) |↑ x + sin(γ/2) |↓ x remains unaffected. The resulting 87 Rb + -ion and electron are accelerated by an electric field to two channel electron multi-pliers (CEMs) placed in 8 mm distance from the trapping region. The ionization fragments are detected with high efficiencies η i = 0.90..0.94 (ions), η e = 0.75..0.90 (electrons), the efficiencies are slightly different for the two labs and also vary between different measurement runs. We assign detection of at least one of the fragments to the atomic state |↑ γ , providing a total detection efficiency of ≥ 0.98 [36,37], while detection of no fragment is assigned to the state |↓ γ . Note that in the event-ready scheme an imperfect detection efficiency does only affect the fidelity of the measurement process.
In order to perform a fast selection of the measurement direction we switch on one of two polarized readout laser beams with an acousto-optical modulator (AOM) (Fig. 2). The latency time from the output of the random bit of the QRNG until the readout pulse reaches the atom is 217 (204) ns. Optimizing the measurement fidelity we accept ions arriving at the detectors up to 570 (725) ns after the beginning of the ionization process. The different times for the two traps result from different acceleration fields and, consequently, different times of flight of the ions. Together with the avalanche transition time within the CEMs and the latency of the processing electronics of 80 (84) ns, the total time until the result appears as a digital pulse at the output is 947 (1093) ns after the starting time of the measurement. We consider this signal being perfectly clonable and, thus, representing a definite classical entity with a value existing independent of observation. It is recorded together with the respective random bit (at trap 1 also with the result of the BSM) in a local storage unit.
We performed several measurement runs in the time period between November 2015 and June 2016. After a first clear violation with 300 events could be observed on Nov. 27, 2015 (see [24] Sec. VI.A), the stability of the setup was improved allowing for long-term measurements. For testing the hypothesis that our experimental results can be described by a LHV theory, a well-defined experimental procedure was established to avoid expectation bias [38]. For that purpose all relevant details were fixed before the start of each run. These include the number of events to be collected, the analysis procedure, as well as scheduled maintenance to be performed, see the Supplemental Material [24] Sec. IV. We chose 5000 events for each prepared atomic state to achieve an appropriate level of significance, evaluation according to Eq. (1) and maintenance every 24 hours. We present two runs fulfilling these criteria in the following.
For the measurement run started on Apr. 15, 2016 the obtained correlations are shown in Fig. 3. For the 5000 events for each of the two atom-atom states collected during 4 days, the resulting S-parameters of 2.240 ± 0.047 (|Ψ − ) and 2.204 ± 0.047 (|Ψ + ) show a violation of the LHV limit by 5.1 and 4.3 standard deviations, respectively. By combining the events for the two atomic states we obtain S = 2.221 ± 0.033 corresponding to a violation by 6.7 standard deviations. In order to determine the impact of these results for ruling out LHV theories we use the null hypothesis that the experiment is governed by LHV. Under this assumption one can estimate the probability of obtaining a certain violation of Bell's inequality or a more extreme one, which is called the P-value. Within the hypothesis one can also allow for potential memory effects [39], where the history of the experiment may influence the probabilities of outcomes. We use two different models for calculating upper bounds for the P-value: the martingale approach [40] (P m ) and the game formalism [41] (P g ), for details see [24] Sec. III. For the combined data of the measurement above we obtain P m = 2.57 · 10 −9 and P g = 1.74 · 10 −10 .
Explicit data for the above run, for the first violation in 2015, as well as of further runs are documented in the Supplemental Material [24] Sec. VI. Especially, we want to point at the run started on June 14, 2016. The start of it was made public via the Twitter account @munichbellexp and simultaneously at a conference [42]. The results of each of the events, coming in at a rate of about 1/min, were directly communicated to a central server http://bellexp.quantum. physik.uni-muenchen.de, which made all the data available together with the momentary evaluation. In this public Bell test, due to the lower rate of trapping single atoms the 2 × 5000 events were collected during a time of 10 days, resulting in S = 2.134 ± 0.048 (|Ψ − ) and S = 2.057 ± 0.048 (|Ψ + ). The violations of 2.8 and 1.2 standard deviations result in P-values for the combined data of P m = 0.0267 and P g = 2.82 · 10 −3 . It should be noted, that with the modest event rate the effect of count-ing statistics on the momentary value of the S-parameter became clearly visible to a wide audience. The complete data are available for download from the server.
Finally, we consider a further frequently mentioned loophole -the free-will (or freedom of choice) loophole [43] targeting the independence of choice of the analysis directions from the hidden variables and vice versa [7]. Contrary to experiments with photon pairs [18,19], event-ready tests using entanglement swapping do not have a typical moment where the LHVs would have been defined [44]. If we assume that the LHVs are defined at the time of the BSM, in our experiment taking place 10.7 µs before the choice of the local analysis directions, they are clearly not influenced by the latter. Yet, contrary, the random settings are determined within the light cone of the BSM and independence has to be assumed here. This was accounted for in [15,18,19] where generation of the random numbers is considered being outside of the light cone of the entanglement generation (allowing to exclude influences within one trial of the experiment up to a few nanoseconds for the photon experiments [18,19] or 690 ns for the experiment using NV-centers [15]).
However, in the analysis of all experiments (including the present one) there is still the implicit assumption that the dependence of the random numbers generated for the n-th observation event on processes or events of any kind in their backward light cone is strongly limited [45](e.g., dependence on previous settings and outcomes of the experiment). Effectively, while one allows memory and by this dependence on the history for the LHV model determining the measurement outcomes, one does not allow memory for the (quantum) systems observed in the QRNGs to determine the settings. To avoid such assumptions -and the corresponding loopholes -and to warrant true independence of the random settings also in view of memory attributed to all quantum systems one should produce random numbers outside the light cones of all other events of the Bell test. Spacelike separated extraterrestrial sources of randomness are required and have to be developed to ensure this [46]. In this Letter we described a highly reliable event-ready Bell test, showing in several attempts a clear violation of a Bell inequality. With violations of more than 6 standard deviations obtained in a run with 10000 events the probability that this actual result could be described by local hidden variables is at most P m = 2.57 · 10 −9 . Taking all data accumulated during a time period of 7 months with over 55000 events (without any postselection) decreases this value to P m = 1.02 · 10 −16 . On the fundamental side, further reducing the number of assumptions on the independence of the randomness generation makes the development of methods for employing extraterrestrial sources highly desirable. From the point of view of applications, where the requirements for the random setting choice are different, our essentially loophole-free Bell test forms a promising platform for device-independent secure communication. The methods and results achieved here pave the way for new developments of quantum information and for future quantum repeater networks.
[45] There are well-developed descriptions accounting for a possible dependence of the random settings on the history of the experiment [ [49] The significance level was chosen to be 0.001. Since we take the uniformity of the P-value distribution as a measure of randomness, the significance level is of no particular importance. Still, the number of files which "fail" a certain test, i.e., where the P-value is too close to 0 or 1 is indeed compatible with the significance level, as expected for a uniform distribution.

A. Synchronization and control of the experiment
The experiment consists of two independent atom traps connected by a 700 m long fiber link for photons emitted by trapped atoms as well as for synchronization and communication between the two setups. All communication between the two sides is done optically at telecom wavelength (Fig. S1, fibers 2-6). Depending on the specific requirements the signals are transmitted via analog (a), digital (d) or network (n) electro-optical converters. Each trap is controlled locally by a PC and a custom-built pattern generator that is used as a control unit (CU). The PC next to trap 1 acts as master. It is capable of controlling the remote PC in lab 2 by sending commands via an optical network connection (fiber 5). It also continuously analyzes the photon counts collected from both traps via the 4 avalanche photo diodes (APDs) of the BSM arrangement. The CUs operate at a clock speed of 50 MHz and thus a resolution of 20 ns with a timing jitter of < 40 ps. They are capable of switching lasers and logic signals in a programmable way and also respond to external signals. All timing-critical components are synchronized to a common 100 MHz clock located in lab 2 whose signal is distributed via fiber 2. Synchronization of the CUs is monitored by a synchronization unit in lab 1 with the help of an additional signal transmitted via fiber 4 guaranteeing altogether a low relative jitter of less than 150 ps rms betwen the two labs. A time-to-digital converter (TDC) with a time resolution of 80 ps, connected to the master PC, records all necessary time stamps, e.g., all photons counts and time markers generated by CUs indicating the different sequences in the experiment.      Figure S2. Experimental sequence.

B. Experimental sequence
Loading of the atom traps is controlled by the PCs. Depending on the level of photon counts integrated within 40 ms, it is distinguished whether atoms are present in the two traps and loading operations are initialized accordingly. Loading of both atom traps takes typically about 2 − 3 s.
After both traps are loaded the master PC initiates switching of the two CUs to the excitation sequence (Fig. S2). The required signal is generated by the start control unit (SCU) in lab 2 and communicated to lab 1 via fiber 3 (Fig. S1). The excitation sequence consists of preparation of the 5 2 S 1/2 , F = 1, m F = 0 by optical pumping followed by excitation to the 5 2 P 3/2 , F = 0, m F = 0 state. After each excitation there is a waiting time of 7.3 µs needed to transmit the photons from trap 2 to the BSM setup in lab 1 and to transmit a potential two-photon detection signal back to lab 2. This procedure is performed in repeated bursts that are timed such that the emitted photons of both trap setups have a temporal overlap close to unity at the BSM arrangement. A successful BSM is registered by one of four characteristic two-photon detection events (1 ⊥ 2 or 1 2 ⊥ for |Ψ − and 1 ⊥ 1 or 2 ⊥ 2 for |Ψ + ) in a time-window of 120 ns [13]. If one of those two-photon detections has occurred, signals are sent to both CUs to switch to the state measurement. Otherwise after 40 preparation-excitation cycles the atoms are recooled for 350 µs and the sequence is restarted giving an average rate of excitation attempts is 5.2 × 10 4 s −1 . If the master PC registers loss of one of the atoms the corresponding trap is reloaded.
C. Detailed timing scheme of the atomic state measurement Fig. S3 shows the measured timings of all relevant processes of the atomic state measurement after a two-photon coincidence. The state-measurements have to be space-like separated which requires fast measurements with precisely known timings.
The common start signal for the state-measurement is generated in lab 1 by a FPGA registering a two-photon coincidence in the BSM arrangement. This signal is distributed to both control units (CUs) locally via a 50 cm coaxial cable and via the 700 m long optical link for the distant location. The propagation time is measured by the delay of a signal transmitted to the remote location and back. This yields signal transmission times of 2.5 ± 0.2 ns and 3717 ± 7 ns, respectively. Additional waiting times on both sides are necessary to minimize dephasing of the atomic state which are chosen such that the state measurement is performed after a full transverse oscillation period of the atom in the respective trap. The CU in lab 1 waits 10.74 µs and in lab 2 waits 7.00 µs, respectively, before a measurement start signal is issued. With this the measurements in the two traps start almost simultaneously (trap 2 starts 28.5 ns earlier).
First, a random bit is requested determining the measurement direction and, in our setup, determining which of the two acousto-optic modulators (AOMs) is activated to switch readout laser beam with the respective polarization. The switching times of the AOMs, given by the propagation of the acoustic wave in the crystal, together with propagation delays in cables, electronic components, and the optical path of the readout pulse to the position of the atom are measured with fast photo diodes placed in an equivalent distance to the atom traps. The delays between the output of a random bit acceptance window (electron) Figure S3. Detailed timing scheme of the experimental sequence after successful BSM. and the beginning of the ionization process are 217 ± 4 ns and 204 ± 4 ns. The time of flight of electrons towards the channel electron multipliers (CEMs) was calculated to be 3 ns. Using this, the ionization time and time of flight of the ions to the CEMs are extracted from the arrival time histogram of electrons and ions. With these times at the beginning of measurement run we define two fixed time-windows for accepting electron and ion detection events chosen to optimize the signal to noise ratio for better fidelity. In the presented measurement runs the lengths of these windows were 240 ns (160 ns) for electrons and 240 ns (220 ns) for ions for trap 1 (trap 2). The maximal time for ionization and flight given by the end of the ion acceptance window is 570 ± 3 ns and 725 ± 3 ns for the two traps, respectively (the uncertainty results from the rising edge in the electron arrival time histogram at the beginning of the process). These are different for two traps as the acceleration voltages between the CEMs differ to avoid high-voltage breakdowns in trap 2. The measurement is considered finished when the detection clicks of the CEMs are transformed to a logic pulse in electronics outside the vacuum chamber which happens 80 ns and 84 ns after the detection. Together with the time of the RN generation/correlation of 80 ns this results in the overall measurement time of 947±1 ns and 1093±1 ns, respectively. Note that this time is known with a high precision as it is based on signals of the CU (starting signal of the measurement and the end of the ion acceptance window). The stated uncertainty results mostly from the additional signal processing units (like signal converters, etc.).

D. Criteria of space-like separation
The distances between the laboratories and in particular between the QRNGs and atom traps were determined by combining measurements within the buildings with maps and building outlines provided by the Bavarian land surveying office  (Fig. S3). By further delaying the measurement at trap 1 the margin could in principle be made symmetric at 304.0 ns.

E. Data recording
For evaluation of the experiment we use two independent ways for data recording. First, for every successful BSM event, the resulting Bell state, the requested random bit and the measurement outcomes (CEM signals) are recorded locally on both sides ("local storage" in Fig. S1). This enables an evaluation of correlations and of Bell's inequality. Second, the TDC records time stamps of all photons registered in the BSM arrangement, as well as of the requested random bits and CEM detection clicks from both sides. Additional signals indicating the position in the experimental sequence are stored in this data stream enabling a complete analysis of the experiment.

II. GENERATION OF RANDOM SETTINGS
For performing a test of Bell's inequality free of the locality loophole the choices of the measurement directions, in the ideal case, are perfectly unpredictable. In our experiment these choices are derived from the outputs of quantum random number generators (QRNGs) which are unpredictable according to the physical model of the QRNGs. Technical imperfections can lead to a residual predictability of the output bit sequences which has to be taken into account in the analysis of the experimental results (Sec. III). In particular to derive the P-values one requires the maximal deviation τ from perfect unpredictability, which is defined such that for all output bits q i : 1 2 − τ ≤P r(q i = 0) ≤ 1 2 + τ . In the following we describe the function of the employed QRNGs and estimate their predictability from a model. Furthermore we perform evaluation of bias, serial correlations and general statistical tests. Although statistical testing can never certify real randomness, it is the method of choice for testing the hypothesis of the bit sequence being random. Moreover, it still gives some information about the quality of randomness of the bits obtained from the QRNGs and allows identifying potential artifacts.

A. Random number generators
The method used for the random number generator (QRNG) [34] is based on counting the number of photons emerging from a light emitting diode (LED) source, passing an attenuator and detected by a photo-multiplier tube (PMT). The analog pulses at the output of the PMT are digitized by a comparator and counted within time bins of 20 ns. The parity of the registered photon number finally constitutes the random bit.
The physical model employs the fact that according to photodetection theory [25,26] the detection events from a broadband light source of constant power are fully uncorrelated on the timescale of the counting interval. In particular, each photon is assumed to be registered by the detector at an unpredictable time which is independent of any previous events in the backward light-cone as well as of the LHVs in the experiment. This leads to a Poissonian distribution of the detected photon numbers at the PMT. In our implementation the extendable dead time of the detector, i.e. its inability to register pulses arriving in a short succession which are interpreted as one long pulse, modifies this distribution [34], Fig.2. This is analogous to a real-time hardware processing additionally allowing to avoid bias of the output bits without the need for any further post-processing [34]. Here we also assume that the detector itself can not be influenced within the LHV model.
In the experiment the QRNGs are operated at a speed of 50 Mbps and provide the last generated random bit on request within 8 ns. Including all hardware latencies, the maximal "age" of this bit is 60 ns since the emission of the detected, contributing photon(s). The generators exhibit a modest next neighbor correlation of typically < 1.5 · 10 −5 while all evaluated correlations with higher lag were found to be compatible with zero (see below). We thus include 20 ns for generation of the previous random bit and consider the output to be at most 80 ns "old". All bits generated during an experimental run were continuously recorded for analysis purposes (organized in files of 1 Gb) [48]. The QRNGs incorporate continuous stabilization and monitoring of temperature and count rate to allow for long-term operation.

B. Estimation of predictability
While the emission and detection of photons are intrinsically random fundamental processes at the current state of knowledge, there are further technical parameters which may affect the output bit. Depending on the model, such parameters may be accessible when the outputs A, B are generated by the observers and thus increase the predictability. The crucial parameters of the generators used are the average photon count rate, the threshold level of the comparator digitizing the PMT output pulses (influencing the extendable dead time [34]) and the temperature of the QRNG devices.

Photon count rate
For a given threshold of the PMT pulse comparator, the bias of the QRNG is a function of the photon count rate. Thus knowledge of the count rate allows a certain predictability. Before each measurement run the QRNGs are operated for a longer period to determine the count rate for minimal bias. During the measurement run the count rate is stabilized at this value by a feedback loop controlling the LED current. Once fixed, the count rate shows only expected statistical fluctuations. While this already shows the high stability of the involved components, we have additionally characterized ) + 1) from both sides yielding the 3σ and 5σ intervals of amplitude fluctuations. This allows us to determine the maximal predictability for the corresponding confidence levels.
the sensitivity of the bias to the LED current leading to effects much lower then the error made when determining the optimal count rate for minimal bias. This leads to a residual bias of typically between 10 −6 and 10 −5 (see Tab. S1).
Threshold level of the comparator Any fluctuations of the threshold level, or complementary any noise on the signal from the PMT will lead to a timedependent bias and predictability. We thus analyze the bias for different threshold levels at a constant LED current. The result is shown in Fig. S4(red crosses). In a second measurement we have determined the distribution of registered counts with LED being switched off. Here we can observe two regions -one dominated by the dark counts of the PMT (< −5 mV) and one dominated by the electrical noise (Fig. S4, inset). For further analysis we assume that the measured data is an integrated histogram of the the noise amplitude distribution. We thus approximate the rising and the falling slopes of the electrical noise part (S4, blue stars) which is centered around the threshold set voltage by error functions obtaining the mean value µ 1 = −9.09 mV and standard deviation σ 1 = 0.13 mV on the rising side and µ 2 = −8.48 mV and standard deviation σ 2 = 0.25 mV on the falling side. Since the electrical noise adds to the real PMT pulses, it can be considered equivalent to noise of the comparator threshold level. Thus the influence of this noise on the predictability (bias) can be directly derived from these numbers. For a "paranoid" model which grants the LHV-controlled observers full knowledge about the noise of the QRNG in the distant lab we can now consider either the average over the time-dependent predictability or merely the maximal predictability (within a 5σ confidence interval). In the latter, most paranoid, case we arrive at a maximal additional predictability of τ = 6.12 · 10 −4 .  Table S1. Bias for various measurement runs. σ corresponds to the standard deviation in a 360 Gb bin.

Temperature of the QRNG device
The temperature of the critical part of the QRNG including LED, PMT and the comparator is actively stabilized to better than ±0.15 • C. Still, the residual fluctuations, which may be accessible, can influence the critical threshold level of the comparator. Here the specified temperature coefficients of the comparator offset voltage and of the DAC providing the threshold level are both 10 −5 V/ • C. This gives together with the bias dependence from above (S4) an additional predictability of τ = 6.7 · 10 −6 (in the case the effects add up).

Resulting predictabilities
Let us distinguish two models: • a (reasonable) model where the information on the internal parameters of the QRNG are inaccessible except for temperature and the count rate (which are both stabilized and monitored with information being available externally). The resulting deviations of the generated bit sequence from an ideal random one determine the bias of the sequence. The maximal observed bias over all measurements (Tab. S1) is 8.74 · 10 −6 , compatible with the expectation. Taking this value and additionally adding a 2σ margin to it we can conservatively estimate a deviation from perfect unpredictability of τ 1 = 1.04 · 10 −5 .
• a (very paranoid) model where also the full information on the internal noise at the comparator is known and can be used when determining the measurement result according to such LHV models. This allows for an additional predictability which might not be visible in any typical statistical test. Note that exploiting this information for the distant lab requires extrapolation of the noise behavior for 1.3 µs into the future. With this ability granted, we sum over all above effects (error in the setting of photon count rate, noise on the threshold level, temperature dependence) arriving at a τ 2 < 6.3 · 10 −4 . We take this value for calculation of all P-values allowing us to exclude even such models.
We note that the predictabilites, even in the paranoid models, can be significantly reduced by performing an XOR operation on several successive output bits thereby combining them into one bit. This operation can be efficiently performed in hardware. In our experiment the time budget allows for combining at least 13 bits, while τ would be < 10 −6 already for a combination depth of 2 bit.

C. Bias
We have evaluated the bias B = n0 n − 1 2 (n 0 being the number of ones in a sample of n bits) of the generated bit sequences. Figure S5 shows an example of the observed bias as a function of time for one of the measurement runs. The bin size is chosen large enough for the statistical noise to be smaller than the observed value, which still allows observing possible drifts on a long time scale. Table S1 gives an overview of the data and the maximal observed bias for different measurement runs.

D. Correlations
For each measurement and both random number generators we analyzed the data of every file for serial correlations (or autocorrelation) according to 1} are the bits and l is the lag (we evaluated correlations up to a lag of 56). This formula is not corrected for bias which in our case would lead only to negligible corrections. All correlations with lag l > 1 were found to be consistent with zero. Tab. S2 exemplarily shows the correlations for one of the measurement runs. We also evaluated the time evolution of SCC 1 , only a small variation could be observed within ∼ 230 h of measurement time, see Fig. S6.

E. Statistical tests
We tested the output bitsequences of the QRNGs using the statistical test suite "TestU01 Alphabit battery" [27]. We applied these tests to all bits collected during the measurement runs on Apr. 15, 2016 and June 14, 2016, which are 115420 files (115 Tb in total). For each data file, all statistical tests were applied[49] and the resulting P-values for the null hypothesis of randomness (iid bits with P r(q i = 1) = P r(q i = 0) = 1 2 ) were calculated. In the ideal case the P-values are expected to be uniformly distributed. This was observed for all of our data and all tests in the battery except for the test row smultin_MultinomialBitsOver (test on uniformity of appearance of bit chains of certain length, evaluated using overlapping serial approach), see Table S3. There the P-value distribution is shifted towards 0. To understand this behavior we applied a related test smultin_MultinomialBits (the same as above but evaluated using non-overlapping serial approach) which can be easier modeled using a noncentral χ 2 -distribution. Fig. S7 shows that our model which includes only the known bias and next-neighbor correlation fits the data well. Thus the applied set of tests did not reveal any effects in the data which are stronger than the next-neighbor correlation. Note that these subtle effects are only visible due to the large amount of data available. Alltogether our findings support the thesis that the bits are in fact random and well-suited for our application.  Table S3. Results of the TestU01 Alphabit test battery for the measurement runs on April 15, 2016 and June 14, 2016. On each side the tests were applied to 16072 and 41638 files of 1 Gb for the two runs, respectively. The resulting distributions of P-values for each test were checked for uniformity with a χ 2 -test, whose P-values are shown here.

III. TESTING LHV THEORIES
The main goal of Bell experiments is to rule out all local-hidden-variable (LHV) theories by measuring a violation of Bell's inequality. Since real experiments can only generate a finite amount of data, even an experiment governed by LHVs may still exhibit a violation of Bell's inequality by chance. To account for this, one estimates the probability that a specific violation or a more extreme one can be produced by an experiment governed by LHVs. If this probability is small for an experimental outcome it is fair to reject the hypothesis of LHVs with a certain confidence. This procedure is called a null hypothesis test and the respective probability is called a P-value. For a null hypothesis test of the validity of LHV theories we have to find the probability distribution for the S-values under the assumption of LHVs. With this probability distribution we can calculate the P-value for any measured value of S.
When performing such an analysis, one needs to be careful not to introduce additional assumptions into the model to be tested. The standard way of the evaluation of experimental data would be to assume Gaussian distribution of measurement results. Then the P-value can be easily calculated for the measured value of S and its standard deviation. However, this requires the assumption that the experimental tries can be considered independent and identically distributed (iid) which is not necessarily valid. First, the experimental parameters may vary over the course of the experiment. Second, from the fundamental point of view, the knowledge of the history of previous settings and outcomes might allow for a LHV model violating Bell's inequality (memory loophole). Depending on the formulation of the inequality this is indeed possible [39], although the violation approaches zero with increasing number of tries N . In any case an analysis procedure which does not require the assumption of iid is needed. Ways to achieve this include modeling of the underlying stochastic process as a martingale [40] or its formulation as a game [41], as well as prediction-based-ratio (PBR) approach [28].

A. Defining the null hypothesis
The first step is to formulate the null hypothesis in a way that we can use to calculate bounds on the P-value. For this we use the CHSH inequality as a mathematical formulation of the hypothesis and define it for our experiment.
Our Bell experiment employs two observers in lab 1 and lab 2. On receiving the heralding signal each side of the experiment gets an input for the setting selection and produces a measurement outcome (in the following we call such a process "event"). We name the inputs for the i-th of N events a i ∈ {α, α } for lab 1 and b i ∈ {β, β } for lab 2. Similarly we name the measurement outcomes for this event x i for lab 1 and y i for lab 2 with x i , y i ∈ {−1, 1} (outcome |↑ corresponding to 1, outcome |↓ to −1). Since our experiment employs an event-ready scheme we name the event-ready signal for this event h i , where h i = 1 heralds the Ψ + -state and h i = −1 heralds the Ψ − -state.
We define the functions g ± (a, b) for events with the event-ready signal h i for Ψ + or Ψ − that take the values 1 or −1: With these functions we can write the CHSH inequality for each state in the form: Here C N ± a,b = N ↑↑ a,b + N ↓↓ a,b is the number of correlated measurement outcomes (x i = y i ) and A N ± a,b = N ↑↓ a,b + N ↓↑ a,b is the number of anticorrelated measurement outcomes (x i = y i ), for state Ψ ± respectively.
For large N and low bias of the input random bits we can approximate C N ± a,b + A N ± a,b with N ± /4 and obtain This form of Bell's inequality is not susceptible to systematic violations exploiting finite statistics shown in [39].
In order to estimate the P-value one can employ concentration inequalities which provide bounds on deviations of stochastic processes from their expectation values. Here we will use the Mc Diarmid inequality [29] in the following way similar to [28]. First we define the sequences We can now write equation (S3) as is a measure for the violation of Bell's inequality. For any experiment governed by LHVs saturating Bell's inequality, Z i is a martingale difference sequence and we can use the equation (6.1) from [29] P r( with t = S ± −2 8 , A = 3 4 , andĀ = 1 4 to bound the probability of a certain violation δ or a more extreme one under assumption of LHVs, thereby obtaining the P-value P m : This bound is also valid for all LHV models which do not saturate Bell's inequality, in which case the process is a supermartingale. The limit of the S-value achievable with LHVs increases if the random bits are not ideal (partially predictable). The deviations from perfect unpredictability τ a , τ b ∈ − 1 2 , 1 2 on the two sides are defined ∀i by where H i is the common history of the experiment at the i-th event, i.e. the complete information available to the two observers which can be used for predicting the setting choice on the other side. For simplicity we set τ = max(|τ a | , |τ b |). We note that the predictability may depend on the history of the experiment H To calculate the effect of the predictability on the S-parameter we consider the following. For example let us assume that for a certain attempt i where the atomic state |Ψ + was prepared, the probabilities of the setting inputs are P r(a i = α) = 1 2 + τ and P r(b i = β) = 1 2 + τ . This means that in this situation the probability of (α, β) input is the highest and of the (α , β ) input is the lowest. An LHV strategy (of the 16 possible) which would maximize the expectation value of S for this attempt would be to produce anticorrelations for all input combinations. Then one obtains for this expectation value As is easy to verify, for any combination of input setting probabilities and each atomic state there exists an LHV strategy achieving this value. Thus, by optimally switching strategies event for event, one obtains the expectation value We consequentially use t = S ± −2 To calculate the P-value for the combined data of Ψ + -and Ψ − -states we use (S11)

C. Game formalism
An alternative approach is to formulate a CHSH experiment as a game [41]. Here, LHV are represented by two parties which have to generate correlations or anticorrelations based on random inputs, where only the local input is known to each party during a game round. Otherwise they may employ any strategy which also may be adapted during the course of the game and are allowed to communicate between the rounds. To win the game they need to produce the right correlations (three anticorrelations, one correlation, depending on the input) maximal number of times.
To describe this formally, we define the functions w + i for Ψ + -events and w − i for Ψ − -events: (S12) If w ± i = 1 the game is won for this round. Thus the total number of rounds won for each state is 1±hi 2 w ± i . For LHV theories the probability of winning a single round of a CHSH-game is [41] Note that, if τ depends on the history of the experiment (Eq. (S10)), the winning probability will also depend on this history. Now we can calculate the probability P r(W, N ) of winning at least W times in N rounds: With the number of wins W ± and number of events N ± for each atomic state we can calculate a P-value P g for each state individually. To calculate a combined P-value for the complete experiment we define the function for a win for Ψ + and Ψ − events The total number of wins for Ψ + and Ψ − is W = W + + W − and can be put in Eq. (S14).

IV. AVOIDING EXPECTATION BIAS
An important property of any scientific experiment should be impartiality. This implies that the assessment of the results has to be based on objective criteria only, devoid of any expectation on the outcome. If no special care is taken about this, results can easily become biased towards the expectation (see, e.g., [38]). This can happen, e.g., by conscious or unconscious discarding of data which apparently do not fit the expected value. Publishing predominantly positive results can lead to a distorted picture in the literature, known as the "publication bias". Vice versa, distorted values in the literature may influence new experiments.
The number of parameters in a complex experiment can be large making it difficult to define a complete set of objective criteria for a decision whether a certain experimental run is valid or not. However, one must not decide on its validity by looking at the result. This also prohibits discarding parts of the data where the result deviates from the expectation or stopping the run prematurely when the result appears acceptable. Doing so will lead to apparently "better" results ("P-value hacking").
To account for this problem we defined a list of rules before a run is started (we admit that completely avoiding bias is extremely difficult and our measures might not be complete): • The number of events to be accumulated is fixed beforehand.
• The acquisition procedure is fixed beforehand. This includes all acceptance time-windows (two-photon coincidence, CEM detections).
• The analysis procedure is fixed beforehand. This includes the calculation of S and P-values.
• Exclusion of events during the experimental run is based only on the two following criteria: laser stability: on each side all stabilized lasers are fed into a scanning Fabry-Perot resonator whose output is measured with a photodiode. The resulting spectrum is represented with an oscilloscope and monitored by a camera. This allows us to (manually) determine the time when a malfunction appeared and to exclude all events between this time and until the problem is fixed. Malfunctions of most lasers will yield no events (as no atoms will be loaded), however problems with the readout laser reduce the fidelity and such events have to be excluded. -CEMs: a high-voltage breakthrough in the detector system can lead to a shutdown. In this case all events are automatically excluded until the problem is fixed.
• The maintenance procedure is performed every 24 hours and is limited to: check of the laser system (frequency stabilization and optical power at all relevant positions), compensation of magnetic fields (for all 3 axes with a precision of 0.5 mG), minimization of polarization rotation in the fibers of the beam splitter in the BSM arrangement including the 5 m fiber connecting trap 1 to the BSM. Together with the automatic polarization compensation procedure of the 700 m fiber from trap 2 this ensures that there is no polarization rotation between different inputs and outputs of the BS in the two-photon interference process.

V. PUBLIC MEASUREMENT RUN
On top of obtaining a conclusive violation of Bell's inequality, an additional goal of our project was to perform an open scientific experiment. This includes defining the rules in advance to avoid expectation bias, as well as making all data available to the public during the whole course of the experiment. For this purpose we have set up a web server http://bellexp.quantum.physik.uni-muenchen.de. All incoming data and other relevant information is presented there in real-time. Important information concerning the measurement is logged and distributed via the Twitter account @munichbellexp.
The public run was started on June 14, 2016 with the goal to collect 5000 events for each of the two prepared atomic states. The run took 10 days including daily maintenance stops. While the obtained violation (Tab. S10) is weaker than in other runs, it is still significant. Note that the situation where a certain measurement run yields results below average is to be expected, the same way results that appear above the average can be obtained due to purely statistical effects.         Table S13. Evaluation of combined S-and P-values for the complete dataset.

VII. INDEPENDENCE OF RANDOM BITS AND NO-SIGNALING
The space-like separation of the measurements in the experiment also implies independence of random bits generated in the two labs. Furthermore there should be no correlations between local outcomes and distant measurement settings, as this would require superluminal communication (signaling). As was pointed out in [30,31], this should be tested as it would (at least) indicate experimental problems possibly disvalidating the Bell test. We thus check our experimental data for correlations between random bits from lab 1 and lab 2 and whether the random input at lab 1 is correlated with the measurement outcome of lab 2 or vice versa.  Table S14. Random bits used for selection of measurement settings in presented measurement runs.
For independent random bits neither the probability for b = 1 or b = 0 should depend on the space-like separated selection of random variable a nor the probability of a on b. To test this hypothesis we perform a two-proportion z-test, since the distribution of the random bits is binomial and thus for large N approaches Gaussian with a known standard deviation. The calculated P-values for our data give no reason to discard the null-hypothesis of independent random bits.

B. No-signaling
Next, we check if the local measurements are correlated with the random inputs on the other side of the experiment. Since we have no sufficiently exact prediction of the outcome probabilities (due to the details of the measurement procedure and available statistics), we employ a two-sample t-test to check the null hypothesis that the measurement outcomes do not depend on the random input of the other side. The distribution of the calculated P-values (Table S15) Table S15. Data for testing no-signaling.