Improved Sterile Neutrino Constraints from the STEREO Experiment with 179 Days of Reactor-On Data

The STEREO experiment is a very short baseline reactor antineutrino experiment. It is designed to test the hypothesis of light sterile neutrinos being the cause of a deficit of the observed antineutrino interaction rate at short baselines with respect to the predicted rate, known as the Reactor Antineutrino Anomaly. The STEREO experiment measures the antineutrino energy spectrum in six identical detector cells covering baselines between 9 and 11 m from the compact core of the ILL research reactor. In this article, results from 179 days of reactor turned on and 235 days of reactor turned off are reported in unprecedented detail. The current results include improvements in the description of the optical model of the detector, the gamma-cascade after neutron captures by gadolinium, the treatment of backgrounds, and the statistical method of the oscillation analysis. Using a direct comparison between antineutrino interaction rates of all cells, independent of any flux prediction, we find the data compatible with the null oscillation hypothesis. The best-fit point of the Reactor Antineutrino Anomaly is rejected at more than 99.9% C.L.


I. INTRODUCTION
Over the last two decades, measurements of neutrino oscillations in the three flavour framework have determined all mixing angles and mass splittings [1]. However, a recent re-evaluation of the prediction of antineutrino spectra emitted by nuclear reactor cores revealed a 6.5 % deficit between detected and expected fluxes at baselines less than 100 m [2]. This observed discrepancy, known as Reactor Antineutrino Anomaly (RAA), with a significance of 2.7 standard deviations triggered a new set of reactor antineutrino experiments at very short baselines of about 10 m including the Stereo experiment.
There are basically two possible explanations for the RAA. One is given by beyond-standard-model physics when assuming the existence of a sterile, i.e. non-weakly interacting, additional neutrino. If the mass splitting with respect to the known three active neutrino masseigenstates is around 1 eV, significant oscillations towards this new neutrino eigenstate may be visible at baselines smaller than 100 m. As a result, a deficit in the detected absolute antineutrino flux is observable at those baselines. Such a sterile neutrino could also explain the lower than predicted electron neutrino rates measured in the calibration runs of the gallium solar neutrino experiments [3,4]. However, the antineutrino flux model relies on single measurements of beta-spectra of thin foils of uranium and plutonium irradiated by thermal neutrons [5,6]. From those spectra, a non-trivial conversion to antineutrino spectra, yields the predicted antineutrino flux [7,8]. Erroneous steps in this procedure or the underestimation of its systematic uncertainties could also potentially explain the RAA or diminish its significance.
This article is concentrating on an approach to test the sterile neutrino hypothesis independent of rate or spectral shape predictions as detailed in the following: The survival probability of initial antineutrinos at baselines below 100 m can be approximated in natural units as P νe→νe (L, E) = 1 − sin 2 (2θ ee ) sin 2 ∆m 2 41 L

4E
(1) where L and E are the baseline and antineutrino energy, respectively, θ ee denotes the mixing angle, and ∆m 2 41 = m 2 4 − m 2 1 denotes the difference of the squared eigenvalues of the new mass-eigenstate m 4 and the first mass eigenstate m 1 . The original RAA is best explained by sterile neutrino oscillations when assuming sin 2 (2θ ee ) = 0.14, ∆m 2 41 = 2.4 eV 2 [2]. This point in the parameter space serves as benchmark in this article. Since the survival probability is dependent on the baseline L and the antineutrino energy E, the existence of a sterile neutrino state manifests in a baseline and energy dependent distortion of the measured spectra. Thus, by segmenting a detector, a prediction-independent relative measurement can be performed.
The running very short baseline experiments DANSS [9], NEOS [10], Stereo [11], PROSPECT [12] and NEUTRINO-4 [13] report significant exclusions of the allowed parameter space established in reference [2]. A recent global analysis, in which the normalisations of the antineutrino fluxes for the fission isotopes were only weakly constrained, suggests the existence of a sterile neutrino with sin 2 (2θ ee ) = 0.035, ∆m 2 41 = 1.29 eV 2 at 2.9 standard deviations C.L. [14]. This result is driven by the NEOS and DANSS data. The PROSPECT, NEUTRINO-4 and Stereo data were not included yet in this study. Using the rate predictions in the global analysis, the significance increases to more than 3 standard deviations. However, in a recent analysis update of the DANSS experiment [15] -now including systematic uncertainties -the significance of their best-fit point at ∼ 1.3 eV 2 reduced remarkably and the agreement with the NEOS best-fit diminished. Results from NEOS rely on a comparison of their data with the measured spectrum from Daya Bay, which involves a non-trivial prediction-dependent conversion between different reactor and detector types [16]. In contrast, NEUTRINO-4 observes an oscillation effect with sin 2 (2θ ee ) = 0.39, ∆m 2 41 = 7.3 eV 2 at 2.8 standard deviations C.L. from their data alone. Such an oscillation is, however, excluded by the global analysis [14]. In particular, a value of sin 2 (2θ ee ) = 0.39 would induce a mean deficit of about 20 % of the antineutrino flux detected by reactor experiments, in conflict with data. The value is also in stronger tension with constraints from cosmology than parameter combinations with lower mass splitting [17]. How these tensions will be resolved is yet unclear.
In this context, we present new results from the Stereo experiment. In 2018, Stereo reported the exclusion of a significant part of the allowed parameter space [11]. There, we also excluded the benchmark point at 97.5 % C.L. using 66 days of reactor-on data. In the current article, we report improved results using 179 days of reactor-on data. In the following, all steps of the analysis chain are detailed. Several of these steps have been largely improved with respect to the previous version of the analysis presented in [11,18]. In Section II, we describe the experimental setup. Section III details the extended dataset while Section IV and V discuss the prediction of the antineutrino spectrum of the reactor and the simulation framework of the Stereo experiment, respectively. After presenting the determination of the energy scale of the detector in Section VI, we give the event selection in Section VII, followed by a description of detailed corrections of our selection efficiency in Section VIII. Section IX illustrates experimental backgrounds in the Stereo detector. They are exploited in a discrimination technique regarding the time shape of scintillation pulses to extract antineutrino candidates in Section X. Finally, we present the current improved results of our oscillation analysis in Section XI before concluding in Section XII.

II. EXPERIMENTAL SETUP
The Stereo experiment is located at the ILL research centre in Grenoble, France. Alongside the Stereo antineutrino detector, the ILL accommodates about 40 different instruments for neutron scattering, nuclear spectroscopy and fundamental physics with neutrons, taking advantage of the high neutron flux produced by its nuclear reactor RHF (Réacteur à Haut Flux).

A. The ILL Reactor Site
The RHF [19] is a heavy-water moderated reactor of 58.3 MW nominal thermal power. Nominal reactor cycles are 45 days long and alternate with shutdown periods for maintenance and exchanging the fuel element. There can be up to 4.5 cycles per year. The single fuel element consists of 280 curved aluminium plates enclosing the fuel, arranged in a hollow cylinder of 41 cm outer and 26 cm inner diameter by 81 cm height (for further details see [20]). Its compactness makes it suitable for searches of short baseline oscillations. The fuel is 93 % enriched in 235 U rendering the contribution of 239 Pu to the total number of fissions minuscule (cf. Section IV). The thermal power is monitored by observing the balance of enthalpy at the primary cooling circuit. Data taking of Stereo falls in a period of increased reactor maintenance, reducing the number of cycles to typically 2 to 3 per year. To increase the lengths of these cycles, the reactor was operated below nominal power. In general, the power lies between 50 MW and 56 MW and is constant for the respective cycle.
The Stereo detector is located on level C of the reactor building which is at the height of the fuel element. Its neighbouring instruments D19 [21] and IN20 [22] extract intense neutron beams from the reactor moderator (cf. Figure 1). They impose rays of γ and neutrons in varying rates on the Stereo detector, as well as stray magnetic fields of up to ∼ 1 mT [23] from high-field magnets (up to 15 T at the sample position inside the instruments). The shielding of the Stereo site was reinforced before the installation of the detector [18] and the Stereo detector is equipped with additional shielding as detailed in Section II B. The detector is partly covered by the transfer channel of the reactor (serving for example for the storage of used fuel elements). This channel is filled with light water and reduces cosmic-induced radiation at the Stereo site, yielding an inclination angle dependent shielding of about 15 metre water equivalent (m.w.e.) on average. The Stereo detector is aligned with this channel, resulting in an angle of (17.9±0.2) • between the detector axis and the direction to the core. The centre of the active detector volume is (10.298 ± 0.28) m away from the centre of the reactor core. This distance includes a vertical distance of 0.21 m since the centre of the active detector volume is located below the centre of the reactor core.  Figure 2.

B. The Stereo Detector
The Stereo detector uses the inverse beta decay (IBD) reactionν e + p → e + + n to detectν e within a 1.6ton gadolinium-loaded liquid scintillator target divided into six optically isolated cells [18]. The liquid scintillator is a LAB-based three-solvent-system using PPO and bis-MSB as fluors while gadolinium (Gd) is soluted in form of a Gd-beta-diketone complex [24]. The IBD identification exploits the time structure of the IBD event: a prompt signal due to e + energy loss and subsequent annihilation with e − followed by a neutron capture on a Gd nucleus. The neutron thermalises in the scintillator and is afterwards captured with a characteristic capture time of 16 µs. The average time of the full process including thermalisation is 18 µs. The neutron capture comes with a characteristic gamma cascade of about 8 MeV total energy. Due to the kinematics of the reaction, the prompt signal carries theν e energy information: E e + = Eν e − ∆M + m e = Eν e − 0.782 MeV where ∆M is the mass difference between the neutron and the proton and m e is the electron mass. The kinetic energy of the neutron is negligible.
The segmentation of the fiducial volume, designated as "Target" (TG), allows spectrum measurements at six different baselines between 9.4 and 11.2 m (cf. Figure 1). Each cell is 369 mm thick, 918 mm high and 892 mm 4 large. The TG volume, delimited by an acrylic aquarium with 12 mm thick walls, is enclosed in a larger volume defining an outer crown around the TG aquarium designated as "Gamma-Catcher" (GC) (cf. Figure 2). The active volume of the GC has boundaries at L×l×h = 3.096 m×1.536 m×1.505 m and consists of unloaded liquid scintillator of similar composition as the TG scintillator [24]. This crown allows to detect escaping γ-rays from events generated in the TG (511 keV γ-rays from the e + annihilation or γ-rays from the n-Gd cascade) and serves also as an active veto against external background entering the TG. The GC is optically segmented in 4 cells, as illustrated in Figure 2. The two cells (GC-Front and GC-Back) in front of the first and behind the sixth TG cell have the same geometry as the TG cells in order to reduce edge effects in the detector response. Separation walls within the same liquid are built with VM2000 TM enclosed in an air-filled gap between two thin acrylic plates (2 mm). In the same way, aquarium walls, side walls, and the bottom plate are made highly reflective using also a layer of VM2000 TM . Imperfect optical separations allow for optical cross-talk between cells of 2 to 11 %.
The scintillation light is readout by 48 8-inch photomultiplier tubes (PMTs), 4 for each TG cell and short GC cell and 8 for each long GC cell. They are located on top of each cell and separated from the scintillator by thick acrylics blocks (20 cm) designated as "Buffers" (cf. Figure 2). The optical coupling between PMTs and acrylic is provided by a bath of mineral oil [24].
PMT signals are continuously digitised at 250 mega samples per second using 14 bit ADCs. A dedicated electronic system [25] was designed for the Stereo experiment in order to trigger, process, and readout the PMT signals. Observation of a signal above the first level trigger threshold (∼ 300 keV energy deposition in one cell) triggers the processing of the event: a constant fraction discriminator (CFD) algorithm is used to estimate the start time of the pulse of each individual PMT, not only the triggered ones. Then, individual pulses are integrated to determine the amount of light observed by each PMT. The total and the tail pulse integral, Q tot and Q tail , respectively, are the result of a Riemann-integration over N tot samples (total pulse duration) and N tail samples (pulse tail duration). Typically 60 samples (240 ns) are enough to fully contain a pulse. Q tot is needed for the energy measurement while Q tail allows to built a metric for particle identification by pulse shape discrimination (PSD), effectively separating interactions from strongly and lightly ionising charged particles. Improvement of the PSD settings implemented in November 2017 resulted in a better separation between electronic (IBD-like) and proton (background-like) recoil components as compared to reference [11]. In standard acquisition mode, only the processed data (CFD time, Q tot and Q tail ) are recorded for all channels. This allows for a very high instantaneous trigger rate and a negligible fraction of dead-time which is typically less than 0.02 %. The number of TG protons is derived from a measurement of the scintillator mass and its hydrogen fraction. The scintillator mass was determined during detector filling by comparing the masses of full and emptied scintillator barrels. The hydrogen fraction was determined by CHN element analysis (combustion analysis) of a scintillator sample [24]. The obtained value is (1.090 ± 0.011) · 10 29 TG protons. However, as the oscillation analysis presented in this article is not using a comparison with an absolute spectrum, the uncertainty of the proton number does not propagate to it. It is a cell-to-cell correlated uncertainty that becomes relevant for absolute rate measurements only.
In addition to the overall fiducial TG volume, it is also relevant to know the volume of individual cells. Exact cell dimensions and associated uncertainties could be measured for the empty detector before filling. From the Cast3M software package [26] which implements a finite element method, the deformation of TG cell walls due to few millimetres static difference in liquid levels were ex-amined. The cell volumes were found to be 301.7 with a relative uncertainty of 0.83 %. This uncertainty enters the oscillation analysis, as it is considered cell-to-cell uncorrelated (cf. Table II).
The detector light response and stability are continuously monitored. A light injection system is used to calibrate the PMTs at the photo-electron level and to monitor the linearity of the electronics. A set of radioactive γ-ray and neutron sources ( 68 Ge, 137 Cs, 54 Mn, 65 Zn, 60 Co, 42 K, 24 Na, 241 Am/ 9 Be) is also regularly deployed inside and all around the detector to monitor the detector response and determine the energy scale. Sources can be placed via three different calibration systems either at any height in the TG cells through vertical Teflon-coated steel tubes located approximately at the middle of each cell (except cell 3 where a filling tube made from pure Teflon is installed) or at any position inside the shielding but outside the detector vessel along the perimeter of the detector or at any point below the central long axis of the detector (cf. Figure 2 and Table I).
In the first period of the data taking called "phase-I" (cf. Section III), several deficiencies of the acrylics impacted the response of the detector. The most critical one was the loss of the oil-bath on top of two buffers (the one in the fourth TG cell and the one in the GCfront cell) due to leaks in the buffer aquariums. This resulted in a drop of the collected light for these cells by a factor 2.5 compared to the other cells. Most of the optically separating walls lost their tightness allowing the LS to fill the air gap around the VM2000 TM layer. The optical cross-talk between cells increased typically from 5 % to 15 %. In the second phase of data taking starting mid-2017 ("phase-II", cf. Section III), light cross-talk was significantly reduced after maintenance on acrylic walls and buffers. Optical cross-talk between TG cells is now at the design value with an average value of 6.9 % (9.6 % for cell 3 to 4) and is stable within ± 7 % (relative deviations) over 16 months, except for cross-talk between cell 3 to 4 which has increased from 6.4 % to 10.2 %. The aquarium walls between TG and GC could not be repaired, so the average value from the first and sixth TG cell to the neighbouring short GC cell is 10.5 % and the average value from each TG cell to each long GC cell is 2 %.
To reduce γ-ray and neutron background, the Stereo detector is enclosed in a passive shielding of about 65 tons (cf. Figure 2) composed of borated polyethylene (29.7 cm on top, 14.7 cm on sides, and 20 cm below) and lead (15 cm on top, 10 cm on sides, and 20 cm below). A water Cherenkov detector used as an active muon veto is placed on top of the lead shielding. It shows a detection efficiency for vertical muons greater than 99.5 %, with a very good stability: observed variations are up to 0.4 % in phase-I and, after the exchange of failing PMT bases, stay within 0.1 % in phase-II. Overall, cuts on the veto (cf. Section VII) reduce correlated background by about 30 % (geometric efficiency). A magnetic shielding composed of several layers of different materials protects the TABLE I. Coordinates of source deployment positions in the internal calibration tubes (top), selected external calibration positions along the X-axis (middle), and selected underneath calibration positions along the X-axis (bottom). The vertical distances of the internal positions to the bottom of the tubes is given as ∆Z. The bottom of each tube has a distance of 2.5 cm to the bottom of the cells. Coordinates marked with an asterisk ( * ) can be chosen freely, the others are given by the detector geometry. See Figure 2 for the definition of the cell numbers and coordinate system. Underneath Calibration System Cell X * /cm Y /cm Z/cm 1  -93  3  -10  2  -56  3  -10  3  -19  3  -10  4  19  3  -10  5  56  3  -10  6  93  3  -10 Stereo system against the effect of the magnetic field generated on the Stereo site. From outside to inside, a soft iron layer surrounds the lead shielding and the muon veto detector, a mu-metal layer is inserted between the polyethylene and the Stereo detector (cf. Figure 2) and cylinders of mu-metal are placed around PMTs. The soft iron layer is covered by boron-loaded rubber (containing B 4 C powder) absorbing the ambient thermal neutrons present in the reactor hall.

III. DATASET
The data collection considered here includes a total of 179 days of reactor-on data and 235 days of reactoroff data [27][28][29][30][31][32][33]. As seen in Figure 3, reactor-on and reactor-off phases alternate, allowing for measurements of background-pure samples roughly about every two months. The first 24 h following a large transient of the reactor power are removed from the dataset to mitigate the corrections to be applied to the prediction of the spec-trum shape (see Section IV).
In March 2017, the Stereo detector had to be retracted from its measuring position to allow for maintenance work at the reactor. In particular, the throughgoing beam tube H6/H7 including its Stereo beamstop was removed (cf. Figures 1 and 5). The Stereo collaboration took advantage of this period to repair defective acrylics and perform maintenance work on the detector. As light collection, cross-talk, and potentially reactor background conditions have changed, data recorded before and after the repair are treated separately in the analysis. The dataset until March 2017 is labelled "phase-I" while the dataset recorded after the re-commissioning is labelled "phase-II".
As mentioned in Section II B, a change in the PSD settings occured during the first reactor-off period of phase-II. To be consistent with reactor-on data, data recorded before the change (37 days) are removed from the dataset of phase-II, but retained for crosschecking between the two phases. This gives a total of 118.5 (60.8) days of reactor-on data and 211.2 (24.0) days of reactor-off data in phase-II (phase-I).
During most long reactor shutdowns, the water level in the reactor pool above the core is reduced from its nominal value of 15 m to 7 m, resulting in an increase of the detected muons in Stereo of about 2.5 %. To study this effect on IBD candidates, the reactor-off dataset includes 61 (21) days of data recorded with a full pool in phase-II (phase-I). In contrast to the pool, the water level in the channel above the detector did not change (cf. Figure 1).

IV. REACTOR PREDICTION
In the oscillation analysis described in Section XI, the non-oscillated spectrum shape common to all cells is let free in the fit, suppressing the sensitivity to the initial prediction of the antineutrino spectrum emitted by the reactor. However, a precise estimate of the expected spectrum lays the basis for absolute measurements, to be discussed in future publications. The evolution of the fuel content of the ILL core during a cycle at nominal full power was first simulated using the FISPACT code [34]. Due to the high enrichment in 235 U of the fuel (93 %) very little plutonium is produced during an ILL-cycle, the mean fission fraction of 239 Pu was found to be 0.7 % only. This contribution has virtually no impact on the predicted emitted antineutrino spectrum. Therefore the Huber model of a pure 235 U spectrum [8] is used. Few corrections must be applied to predict the spectrum expected in the Stereo detector [35]»???«.
The Huber model provides a snapshot of the fission spectrum after only 12 h of irradiation of a 235 U target. Thus, the accumulation of the fission products with lifetime comparable to or larger than 12 h must be estimated to extrapolate to a typical 50-day long ILL cycle. This effect is called the "off-equilibrium" correction. After a reactor stop, the spent fuel is stored at specific locations in the water channel above the Stereo detector. The same long-lived fission products present in the spent fuel keep emitting antineutrinos that contribute to the spectra measured during the reactor-on and reactor-off periods. These two corrections have been calculated by coupling the isotopic inventory predicted by FISPACT along the reactor history with the BESTIOLE code [36] providing the antineutrino spectrum of each isotope. They are displayed in Figure 4 as the mean relative correction factor δ to be added to the Huber model of the antineutrino energy spectrum for a 50-day cycle at nominal power: As expected, the corrections affect the lowest energy part of the antineutrino spectrum. The visible structure in energy is attributed to dominant long-lived fission products 144 Pr, 92 Y, and 106 Rh with end-points reaching 3 to 3.5 MeV along their respective decay chains. Their amplitude is smaller than typical off-equilibrium and spent fuel effects computed for commercial reactors due to the short duration of the ILL cycles, which prevent significant accumulation of very long-lived isotopes. In the Stereo analysis, the prompt energy threshold (equivalent to E ν > 2. It does not include uncertainties in the amount of aluminium. The source volume effect was taken into account by calculating the geometrical solid angle to the detector from each emission point. Compared to the source volume effect of antineutrinos produced by fission, the correction amounts to (+1.0 ± 0.3) % leading to an effective 27 Al capture rate of (0.346 ± 0.001) per fission. The β − -decay of 28 Al produces antineutrinos with energies as high as 2.86 MeV leading to a mean correction of −10.0 % in the first 500 keV energy bin of the oscillation analysis. The next contribution, yielding a −1.0 % correction, is from the capture on 55 Mn. We attribute a 5 % uncertainty to the total correction, accounting for tolerances in the structural elements as well as in the amount of manganese in the aluminium alloys. All other (n,γ) reactions lead to either stable nuclei or antineutrinos below threshold. The initial prediction of the spectrum shape used in the following is then taken from the simulation of about 10 million IBD events weighted according to the Huber model for pure 235 U. It is corrected for the three abovementioned effects. Since the antineutrinos from 28 Al and 56 Mn have a similar origin in space as those from the core (including off-equilibrium antineutrinos) and since the antineutrinos from spent fuel are negligible, we apply the oscillation model to the full corrected spectrum.

V. MONTE CARLO SIMULATIONS
The Monte Carlo simulation framework of Stereo is built using the Geant4 toolkit (version 10.2) [37,38], and it is based on generic libraries for liquid scintillator detectors GLG4Sim, tested and developed in experiments like KamLAND and Double Chooz [39,40]. The simulation includes all aspects of an event from particle interactions in the detector to light collection in the PMTs. It is implemented such that MC output has the same format as real data. At this point, real data and simulation can be processed analogously during the analysis. For this purpose, the entire geometry of the detector has been implemented, including all the mechanical parts, photomultipliers, shielding components, the Cherenkov-veto, the calibration systems, and the main reactor building elements. The latter is mostly relevant for the simulation of the cosmic background, making use of the CRY libraries [41].
Special interest has been put into replicating the scintillation process in the simulation. Several intrinsic properties of the liquid scintillator have been tested in dedicated laboratory measurements [24]. These, together with information from calibration data, have been used to fine-tune the scintillator parameters in the simulation of the Stereo experiment. Some of these variables are the quantum yield of the fluors, the total attenuation length and light yields of both liquid scintillators. The quantum efficiency of the PMTs has also been accurately described in the simulation using measurements from reference [42].
In the simulation, the TG proton number is calculated from the vessel volumes, the densities of the solvents and the chemical composition of all the scintillator components. The linear alkyl benzene (LAB) as main compo- nent (73 wt.%) has not a well defined molecular formula, since it is a mixture with hydrocarbon chains of different lengths. Therefore, approximated values are used for the number of hydrogen and carbon atoms per molecule in the MC to match the average molecular mass as specified by the supplier. The scintillator density is calculated from the individual densities of the solvents which could also introduce a slight bias in the TG proton number of the MC. This is mainly due to the small contributions of the wavelength shifters and the Gd-complex, which are not taken into account. Overall, the calculated TG proton number in the simulation is less precise than the one obtained from the mass and hydrogen fraction measurements as explained in Section II B. Therefore, a correction factor of 0.983 ± 0.010 on the normalisation of the simulated antineutrino events is applied for the scintillator below the acrylic buffers (height 91.8 cm). For the small volumes at the side of the acrylic buffers, average values of the liquid levels in phase-I and phase-II are used. The associated uncertainty related to this subvolume can be neglected. Since, in Section VII, we find the fraction of selected IBD candidate events outside of the TG volume to be much smaller than 1 %, we can neglect volumes other than the TG in the calculation of the proton number. The bulk properties of reflectivity and transmission of the separative plates were also carefully treated. They impact directly the collection of light, the top-bottom asymmetries, and thus the energy resolution. As explained earlier, the walls are made of two acrylic plates with, enclosed between them, one or two sheets of VM2000 TM mirror films and a nylon mesh that ensures an air gap between the plates. VM2000 TM mirrors (current name is Enhanced Specular Reflector -ESR) are multilayer films made of highly birefringent polymers and were developed by 3M [43]. They exhibit a high reflectivity in a large range of wavelengths at any incident angle in air. The reflectivity of the film used for Stereo was measured with a spectrophotometer to be above 98 % in the 400 -950 nm range in air at an angle of incidence of 7 • . As stated in Section II B, liquid scintillator entering and filling the air gap of the separation walls degraded the reflective capabilities of the ESR film, resulting in increased light cross-talks. Indeed, the reflectivity of an ESR film in a liquid strongly decreases at large angles [44]. The transmission was e.g. measured with a spectrophotometer to increase up to 50 % for angles of incidence larger than 75 • at 450 nm when the VM2000 TM was placed in the liquid scintillator. To account for this scenario, a versatile description of light cross-talk between cells was implemented in the simulation of Stereo via the introduction of individual filling levels for all acrylic walls (side walls, separation walls, bottom plate) and a model describing the effect of the liquid on the reflectivity of the ESR film. The addition of absorption effects were found to be necessary to describe accurately the widths of the calibration peaks and to reduce the differences of top-bottom asymmetries between data and MC. They were implemented via two terms: absorption of part of the light transmitted through the ESR films due to the shadow of the mesh, and reduced reflectivity of the Teflon-coated calibration tubes (considered as perfect diffusers in earlier analyses). A separate tuning of these two degenerated parameters was made possible by the fact that cell 3 has no calibration tube, but a filling tube instead, made of highly reflective bulk white Teflon.
The main parameters of the model (levels of liquid in the wall, absorption terms, transmission at large angle of incidence in VM2000 TM ) were optimised in a global approach to reproduce the measured light leaks and col-lected charges for sources at different height. 24 Na and 54 Mn calibration sources deployed at five different positions within each of the calibration tubes were used for this purpose. The well-localised vertices of the 834 keV gamma rays from 54 Mn decays and the coincident gamma rays at 1368 and 2754 keV from 24 Na decays provide complementary information. Examples of agreement on the collected charge between data and simulation are shown in Figure 6, when the 24 Na source is placed inside the cell or in a neighbouring cell. The collection of light is well reproduced in both cases. When the source is in a neighbouring cell, the prominent structure at low charge was used to gain sensitivity to light leaks and to set the liquid level parameter for the considered separating wall. In general, better agreement is observed for the middle source positions rather than for top or bottom positions.
To quantify the residual discrepancies between simulated and measured spectra, an homogeneous scaling factor, denoted as dilatation factor, is applied to the simulated charges such that the simulated spectrum matches best the measured spectrum. Figure 7 shows an overview of the dilatation factors obtained for each of the 25 measured positions of the manganese calibration source (cell 1, 2, 4, 5, 6 and height 10, 30, 45, 60, 80 cm from the bottom of the calibration tubes, cf. Table I). Most factors are within a ± 1 % band around 1 illustrating the very good agreement at the raw charge level, before considering any energy reconstruction. Two outliers at the 2-3 % level correspond to the 10 cm position in cells 4 and 5. The origin of these could be due to a local effect of the bottom plate in these cells, to be investigated in a future version of the simulation. This fine tuning of the optical parameters was performed by comparing the simulation with the data of 28 th April 2018, corresponding to about the middle of phase-II. It is the purpose of the energy reconstruction to correct to first order the residual time evolution of the light collection in the detector, mitigated during phase-II (see Section VI).
Lastly, additional attention was given to the simulation of neutron mobility and gammas released by their capture by Gd nuclei. In particular, the refinement of the description of thermal scattering by including molecular data for hydrogen improved the agreement between the simulated and observed capture times using an AmBe source. Moreover, the description of the de-excitation cascades of the relevant Gd isotopes 156 Gd and 158 Gd were significantly improved, after making use of the FIFRELIN code [45]. Since the Gd level schemes are not experimentally known up to the energies of the relevant excited states after thermal neutron captures, models are required to describe the de-excitations of the formed nuclei. In the FIFRELIN simulation, a set of nuclear level schemes is sampled from the most updated experimental data and the recommended nuclear models and data evaluations. De-excitation processes are then generated within a Monte-Carlo Hauser-Feschbach framework. This approach allows to take into account nuclear structure uncertainties, and provides an accurate description of the energy and multiplicity distributions of the de-excitation products. An high-statistic sample of de-excitation products generated from FIFRELIN is used in the Stereo simulation on an event-by-event basis for each cascade following a neutron capture on the 155 Gd or 157 Gd isotopes [46,47].

VI. DETECTOR RESPONSE
A good understanding of the energy scale is an essential prerequisite for the analysis presented in this article. Because of the steep variations of the detected IBD spectrum the study of its shape is extremely sensitive to the control of the energy scale [48]. Based on the comparison between identical cells, the search for the oscillation signal presented here mitigates the impact of an overall bias at the detector level, but the uncorrelated effects between cells must be accurately estimated. The strategy for studying the detector response for phase-II of the experiment remains similar to that described in reference [18] for phase-I. However, between the two phases the detector was emptied and the acrylics repaired. By this, the light collection properties including the light crosstalk between cells, a dominating effect in the description of the detector response of phase-I, have changed substantially. Moreover, the Monte Carlo was re-tuned to new optical parameters. The systematic uncertainties of the energy reconstruction are therefore considered to be essentially independent between the two phases.

A. Non-Linearities
The detector response deviates from a perfect linear model due to two main causes. Quenching for high dE/dx, intrinsic to the liquid scintillator, leads to a non-linear response for low energy deposits. This phenomenon is described by an effective Birks coefficient k B [49]. Moreover, Cherenkov-radiation contributes to the non-linearity. In addition to the non-linearities, there is a complex collection of light in the Stereo detector: Norm. Calib. Coefficient the attenuation length in the liquid induces a dependence on the vertex and, although stabilised, light cross-talks are still present between cells.
To disentangle the various effects, point-like radioactive sources are deployed in the internal calibration tubes to study the quenching effect. These measurements are performed during periods where the reactor is turned off to minimise environmental background, in particular from 41 Ar present in the reactor building. The light cross-talks are estimated by looking at the ratio of the charge collected in a neighbour cell to the charge collected in the calibration cell. This analysis is performed using 54 Mn source runs. These cross-talk values are then used to select full energy deposits in the calibration cell by limiting the charge detected in the neighbour cells. The cell studied is taken as the one next to the source cell to have a better separation of the gammas in the case of multigamma sources. In addition, for multi-gamma sources, a deposited energy, corresponding to the second gamma, is required in the gate cell, e.g. cell 6, which is opposite to the calibration cell, e.g. cell 4, with respect to the source cell, e.g. cell 5 (cf. Figure 2). The light cross-talks between the gate cell and the calibration cell are negligible. The same method is used to separate the neutron and the gamma of the AmBe source. Identical cuts are applied to the measured and simulated events. The position of the charge peaks thus formed is then determined. The events with a charge included in a determined range around the positions are selected. For measured and simulated events, the mean deposited charge is taken as the  average of the charges of these events. The mean true deposited energy, available in simulations only, is also determined by an average over the selected events. The calibration coefficient for data and for MC is calculated for each source as the ratio of the respective mean charge over the mean deposited true energy. To compare the results in different cells, the coefficients are normalised to the one of the 54 Mn source, chosen as the calibration anchor point of the experiment (see Section VI B). As the normalised coefficients agree within their uncertainties across cells, it is then possible to average them over the 6 cells. For the special case of the AmBe 4.4 MeV gamma, the average is taken only over 4 cells. This is due to the fact that the calibration tubes are not centered in the cells along the X-axis (cf. Figure 2). Therefore, in case the tube in e.g. cell 5 is used, the number of neutrons reaching the further cell, i.e. cell 6, is too little. Thus, only the closer cell, i.e. cell 4, can be used as gate cell. This prevents calibration if cells 1 and 4 are the cells of interest, because the neighbour cells 0 and 3 do not have a calibration tube (cf. Figure 2). The uncertainty takes into account the discrepancies obtained by varying the cuts on the deposited charge in the neighbour cells and the limits of the averaging range of the coefficients. The averaged coefficients in Figure 8 clearly show the expected quenching effect, responsible for a reduced light yield at low energy. After adjusting the effective Birks coefficient in the MC, the agreement between experimental and simulated curves reaches sub-% accuracy with k B = (0.096 ± 0.007) mm/MeV. Note that the quenching effect associated to positrons (prompt IBD signal) is smaller than for gammas. However, a good control of the reconstructed gamma energies is mandatory to study the systematic effects of the energy scale with radioactive sources and to accurately control the neutron detection  Table I). When two γ-rays are emitted, the energy used here is the sum. Data points are all contained in an uncertainty band of ± 1.0 % shown in grey. Uncertainties include effects from data statistics, as well as fit range and time dependence of the peaks, but no correlations across sources are assumed.
efficiency via the gamma cascades of Gd isotopes.

B. Calibration Anchor Point
As described in reference [18], conversion factors between collected PMT charge and light cross-talk between neighbouring cells on the one hand, and reconstructed energy on the other hand, are calculated by iteratively tuning these parameters to 54 Mn data, taken on a weekly basis. This source emits a single 835 keV γ-ray with an interaction length of ∼ 10 cm in the liquid scintillator, comparable to the dimensions of a cell along the X-axis (∼ 37 cm), but significantly smaller than the dimensions along Y and Z (both ∼ 90 cm). Thus, events attributed to a specific TG cell are mainly acquired when the 54 Mn source is located inside the same cell. However, a few events are also recorded when the source is located in a neighbouring TG cell. For the special case of cell 3, only events from the neighbouring cells are recorded (cf. Figure 2).
A set of parameters representative of the average response in the entire volume of the cell, the only quantity accessible with the formalism of energy reconstruction, is obtained by taking the average of the responses of five source positions distributed along the vertical calibration  tube (Z-direction, cf. Table I). As for phase-I, the experimental and simulated 54 Mn peaks can be aligned simultaneously in all cells with high precision (cf. Figure 9). The residual cell-to-cell fluctuations are reduced to the 0.2 % level and enter the oscillation analysis (cf. Table II).
The 54 Mn source was chosen to define the reference gamma energy, because among the sources with long halflife, it is the single-gamma source which has the highest gamma energy being reliably separable from gamma rays of 41 Ar decays. As 41 Ar concentrations in the reactor building vary during reactor-on periods, single-gamma sources with higher energies, like 65 Zn, do not constitute an applicable alternative. The choice of a reference gamma energy at 835 keV corresponds to a larger quenching effect than for the positrons of the IBD candidates, in the [1.625, 7.125] MeV range. Therefore, the reconstructed energies in that range are a few percent higher than the "true" values. There is no attempt to correct the data for this effect. Instead, all relevant effects are included in the MC and any model or true energy information to be compared with the data is forward folded with the simulated detector response. This avoids a troublesome unfolding procedure and guarantees the most accurate treatment of the data.

C. Reference Energies from Radioactive Sources
Once the parameters of the energy reconstruction are fitted on the 54 Mn data and the k B coefficient is optimised, different radioactive sources are studied in order  Table I).
Data points are all contained in an uncertainty band of ± 1 % shown in grey.
to verify the good quality of the agreement between data and MC over the widest possible range of energy and position. For each nominal energy, the reconstructed energy E rec and the resolution, defined as the ratio σ(E rec )/E rec , with σ(E rec ) the standard deviation to the right of the full energy peak, are reported. Figure 10 shows that the mean ratio E Data rec /E M C rec over the five vertical positions stays within ± 1 % for all sources. The resolution, shown in Figure 11, follows a 1/ √ E law at low energy and then saturates around 5 % at higher energy. This behaviour is reproduced by the simulation and is interpreted as the convolution of the pure statistical resolution with the variation of a few per cent of the collected light according to the altitude of the energy deposit in the cell.
The deployment of γ-sources in the internal calibration tubes only probes the detector response in the central region of each cell. At the cell border (Y -direction), additional effects might influence the detector response. To probe the border regions of the six cells, an AmBe source is deployed outside the detector vessel, but inside the shielding, via the external calibration system (cf. Figure 2). Per cell, six positions (two Y -positions and three Z-positions, cf. Table I) are investigated. All points are located at the centre of the cell (X-axis). The AmBe source emits neutrons which are able to traverse the GC cells that lie between the source and the TG cells. The agreement between data and MC for both sides and the centre of the detector is at the 1 %-level across all cells (cf. Figure 12). Since we find the same values when comparing the external calibration points with the internal points, we do not consider any additional correction or uncertainty for the reconstructed energy of events close to the cell borders.

D. Time Stability
To complete the study of systematic uncertainties, a set of events spatially distributed as similarly as possible to IBDs is ideally sought. A first sample of useful data is provided by the capture of neutrons from cosmic rays by hydrogen. The simulation of cosmic rays in the Stereo environment confirms the predominant role of the lead shielding in the generation of secondary neutrons. However, a significant fraction of these neutrons are absorbed by the materials arranged for this purpose around the TG volume: borated polyethylene, the GC and acrylic buffers. Remaining neutrons are those with high energy. Their capture vertices are widely distributed in the whole detector. The n-H capture peak is isolated in the data by looking for events around 2.2 MeV less than 300 µs after a muon signal in the veto or the detector. The central value and standard deviation of the reconstructed energy peak are extracted from the Gaussian core portion of a Crystal Ball fit function. Defining 20-day time bins, the detector response is found to be very stable at the 0.3 % level in each cell (cf. Figure 13). The remaining drifts are fully correlated across all cells. Therefore, this uncertainty on the time-stability is taken as estimate of the cell-to-cell correlated systematic uncertainty (cf . Table II).

E. Information from Continuous Energy Spectra
The residuals in Figures 10 and 12 show no other significant correlated systematic uncertainty to be considered. The cell-to-cell fluctuations are studied instead to determine the uncorrelated systematic uncertainty.
The simplest estimate would consist in setting a 1 % uncorrelated uncertainty on the calibration coefficients (cf. grey area in Figure 10). However, for a more robust analysis two steps beyond this approach were taken. First, higher order terms were allowed to fit the energy dependence of the residuals, and second, data complementary to the calibration sources were added in a combined fit of the energy scale in each cell. In principle, the absolute position of the 2.2 MeV peak from the data of cosmic n-H captures would be a good candidate to complement the sources, but testing the energy scale at the (sub-)% level requires a quite accurate control of the simulated distribution of vertices, which is still not demonstrated for these complex events induced by muon spallation events in structures outside the TG volume. At the current stage of the simulation, we estimate that the absolute position of the n-H capture peak induced by cosmic rays is affected by systematic uncertainties comparable to the small distortions we want to test.
Therefore, the analysis of another complementary source of calibration data has been developed: the βdecay spectrum of the 12 B isotope. This unstable nucleus is naturally produced by the exposure of the liquid scintillator to cosmic rays. Two dominant processes are involved with the 12 C atoms of the liquid as common target: spallation reactions and capture of negative muons stopped in the TG volume. The produced 12 B nuclei undergo β-decay with a half-life of 20.20 ms and a Q-value of 13.37 MeV, thus largely covering our energy range of interest.
The combination of a high muon rate in Stereo and the long lifetime of 12 B makes it challenging to select time-correlated pairs of events. To optimise the selection, the discrimination power of the muon capture process is exploited. Muon candidates are chosen in the [45,120] MeV energy range corresponding to a distribution of stopping points covering a fairly large fraction of the cell volume, but rejecting traversing muons and cosmic showers. The µ − capture process is further enhanced by rejecting pairs from decaying muons identified by their associated Michel-electron. Their characteristic signature is defined as a coincidence between a muon candidate followed by an event in the [5,70] MeV energy range within 6 µs. Then, the boron candidate is searched in the same TG cell and in a [2,35] ms time window after the muon candidate. The spallation neutrons are an important source of fake 12 B events through their capture in the detector. This contribution is mitigated by requiring no muon event in 200 µs before the 12 B candidate. The remaining background is subtracted by looking for accidental coincidences in ten off-time windows (cf. Section IX A). That way, 793.2 ± 3.5 boron candidates are  Here, the location of stopping muons is relying on simpler physics of muon energy losses. Muon angular and energy distributions are well known at the surface level and have been cross-checked by on-site measurements [18].
As for the radioactive sources, the quality of the energy scale is determined by comparing this experimental spectrum to the simulation. In the 12 B MC event generator, the β-transitions to the ground state of 12 C (98.22 % branching ratio) and to the first two excited states have been included with their associated γ and α particle emissions, accounting for a total branching ratio of 99.94 %. These three transitions are of allowed type with well-known endpoints. Thus, the theoretical uncertainties on their spectrum shape are dominated by the following correction terms to the Fermi theory: weak magnetism and radiative corrections. For both terms, the 1σ uncertainty band is obtained by comparing simulated spectra with different scenarios of these corrections. The expression of the weak magnetism correction is varied between the original approximate formula in [50] and a more recent calculation [51]. In a recent publication of Daya Bay [52], it is argued that radiative corrections should not be applied to the simulated 12 B spectrum because the virtual loop corrections do not change the detected kinematics and the energy taken by the emitted real photons is actually deposited in the liquid scintillator. In the case of the smaller Stereo detector, real photons could escape the active volume and their 1/E probability density also induces some quenching. Thus, the variations of the spectrum simulated with or without the radiatives corrections were taken as a ± 1 σ envelope around a mean spectrum, defining the reference. Another source of uncertainty is the distribution of muon stopping points given as input to the event generator. The nominal X, Y , and Z-distributions were taken from the muon simulation of the Stereo experiment, applying the same energy and topological cuts as in the data analysis. Mild dependences on X and Y are obtained while a clear correlation between Z and the selected prompt energy shows up, as expected. The configurations with all vertices at the centre of the cells and with uniform vertices in the whole volume were also simulated to test the sensitivity of the predicted spectrum shape. As these are extreme cases, the associated uncertainty band was taken as a ± 2 σ level. All the above systematic effects are well fitted by linear distortions of the simulated 12 B spectra and the uncertainties are propagated using a covariance matrix formalism. The uncertainties associated with the radiative corrections and the vertex distribution are dominant and comparable in magnitude.
Finally, a possible contribution of 12 N β-decays was investigated. This isotope can be produced from 12 C nuclei either by charged-current interaction of a µ + on 12 C or charge exchange after proton production by a spallation process (although this last process should be suppressed by our selection cuts). The half-life of 12 N is 11 ms with an endpoint of 17.38 MeV, thus passing the boron selection cuts. The higher endpoint provides some sensitivity to this contribution and the experimental spectrum is fitted with a weighted sum of the two β-emitters. The fitted contribution of 12 N is found to be negligible as it amounts to (1.2 ± 0.3) % of the integral of the total spectrum. A reasonable agreement between the measured and simulated shapes is obtained (cf. Figure 14), with a χ 2 /ndf = 40/25 corresponding to a p-value of 0.03.

F. Combined Analysis of all Calibration Data
Separately in each cell, a combined fit of the boron spectrum (cf. Section VI E) and the radioactive source data (cf. Section VI C) with a distortion of the energy scale is performed. The residuals of radioactive sources are directly probing the discrepancies between the experimental and simulated energy scales. In the case of the continuous boron spectrum, the impact of a distortion of the energy scale is propagated following the formalism described in reference [48]. Polynomial functions of different degrees (between 1 and 4) are fitted to the combined data to probe possible energy non-linearity models. A typical example of a combined fit by a second-order polynomial is illustrated in Figure 15. The corresponding distortions of the detected IBD spectra are again calculated with the formalism of reference [48] and are depicted in Figure 16. We find the fit results of cells 1 to 5 in better agreement among each other as compared to cell 6. Depending on the cell and the model, the amplitude of the fitted distortion varies, but its shape remains similar for all fits and is always contained within the abovementioned simple model of ± 1 % uncertainty on the linear calibration coefficient. Therefore, we take this envelope, mostly determined by the deviation of cell 6 (−1 %), as a 1 σ uncertainty with 100 % energy-bin-to-energy-bin correlation and no cell-to-cell correlation. Testing the impact of a cell-to-cell uncorrelated bias at the level of ± 1 % in a pseudo-experiment, we find the results of the oscillation analysis to be stable. This study completes the list of energy-scale-related uncertainties used for the oscillation analysis, reported in Table II.

VII. EVENT SELECTION
The first step of background discrimination is done by applying cuts. Each cut aims for an optimal compromise between the signal acceptance and background rejection by minimising the relative statistical uncertainty on the IBD rate. At the same time, cuts are chosen such that cut results are stable under small cut variations. Using an MC simulation of reactor antineutrinos and taking into account solid angle effects, the spectral distortion caused by a varying cut acceptance with energy is evaluated for each cut and each cell. Too strong distortions by energy or cell may fake or obscure patterns of neutrino oscillations.
All cuts are given in Table III. In the following, we discuss each cut individually. We report cut acceptances as a cut = N all cuts N all cuts w/o studied cut and we define the vertex cell of an event as the cell that exhibits the largest reconstructed energy. Cut #1 selects prompt IBD candidates based on the expected energy spectrum of reactor antineutrinos. We find an acceptance of 89.5 % of the lower cut at 1.625 MeV and 99.7 % of the upper cut at 8.125 MeV. Both cuts show the same acceptance for central cells, but a small discrepancy is observed for the outer cells (up to ∼ 0.2 % for cell 1). In the final extraction of IBD events described in Section X, only the energy bins up to 7.125 MeV are considered due to the low event statistics and low signalto-background ratio at higher energy. A cut at this energy has an acceptance of 98.5 %.
Cut #2 selects delayed neutron captures by Gd, releasing gamma rays with a total energy of ∼8 MeV. By this technique, the number of accidental coincidences, which are mainly caused by low energy events, is strongly reduced. The lower cut value of 4.5 MeV is optimised to include a wide part of the tail of the gamma cascade of the Gd events while avoiding energy depositions of neutron captures by hydrogen at ∼2.2 MeV. Hydrogen is mainly present in the detector as part of the organic scintillator. The lower cut has a signal acceptance of 76.0 % with a small difference for cells 1 and 6 of ∼0.6 %.
The rejected IBD events are mostly those where the neutron is captured by hydrogen. The upper cut value at 10 MeV is fixed to the end of the delayed spectrum in detected energy (see discussion in Section VI B). It has an acceptance of 99.9 %. While the upper cut shows a flat dependence on energy and cell number, the lower cut preferably removes events at low prompt energy leading to a 1.4 % evolution over the prompt energy cut range for all cells. Looking at the vertex distribution of rejected events, one finds that mainly events close to the bottom and top of the TG are rejected. Those events are likely to deposit some energy in inactive detector components, since the GC does not extend above and below the TG (cf. Figure 2). Since the correlated positrons have a similar spatial distribution, they more likely deposit only a part of their energy in the detector compared to events in the detector bulk, due to leaking annihilation gammas or positron trajectories. Thus, not applying this cut adds events where a part of the prompt energy was lost (cf. Equation 3). Therefore the formal acceptance of the cut is lower at low prompt energy than at high prompt energy. Cut #3 allows the rejection of background by exploiting the delay of the neutron capture in the TG due to thermalisation and the time constant of the capture process. For the Stereo TG scintillator this time constant is found to be 16 µs. The lower threshold at 2 µs allows the rejection of decays of stopping muons and has an acceptance of 98.0 %. The decay of stopping muons can mimic an IBD event when the energy of the end of the ionisation track of the muon falls into the prompt energy window and the energy of the produced Michel-electron falls into the delayed energy window. As the lifetime of muons is rather short at 2.2 µs, this background can be reduced by the lower time cut. The upper time cut is fixed to 70 µs where the contribution of accidental coincidences becomes dominant over physically correlated events and has an acceptance of 97.6 % for cells 2 to 5, 97.4 % for cell 1 and 96.7 % for cell 6. While the lower cut shows no effect on energy or cell number, the upper cut shows a 0.7 % effect which is slightly bigger (1.0 %) for cell 6. The different value for cell 6 can be explained when considering the directionality of incident antineutrinos that is propagated into the IBD neutrons. Cell 6 is the last cell before the GC (cf. Figure 1) and therefore there is a higher chance that neutrons have a trajectory that reaches the GC. Since the capture time constant is longer in the unloaded GC scintillator, it can take more time before the neutrons re-enter the TG and get captured.
Cut #4 exploits the fact that accidental coincidences are not correlated in space. Thus, they show on average a larger distance between prompt and delayed vertex as compared to an IBD event, where positron and neutron originate from the same interaction vertex. This cut has an acceptance of 99.3 % and shows negligible spectral distortions.
Cut #5 allows one of the 511 keV gammas of an IBDpositron annihilation to escape from the vertex cell. The two 511 keV gammas are unlikely to be detected in the same neighbour volume, and the margin between 511 keV and the 1 MeV cut is safe against the light cross-talk from the vertex cell to its neighbours. Cut #6 requires no energy deposit far away from the vertex cell. The acceptances of the cuts #5 and #6 are 98.6 % and 99.6 %, respectively. Cells 1 and 6 show a 0.3 %-points higher acceptance, mainly because other cuts already removed events at the border to the GC and those cells are mainly surrounded by GC cells. The energy distortion of these cuts is 2.9 % and 0.4 %, respectively, over the full energy range. It is caused e.g. by fluctuations in the energy transferred to the neighbour cell due to energy cross-talks being larger at higher prompt energies.
Cut #7 allows to reject high energy background events coming from the sides into the detector. Those contribute mainly to accidental coincidences. The cut accepts 98.8 % of IBD events for cells 2 to 5, but shows a lower acceptance (96.0 %) for cells 1 and 6 since one of their neighbour cells is a short GC cell instead of a TG cell (cf. Figure 2). The energy distortion is at the 0.3 %-level for central cells and 1.0 % for cells 1 and 6 for similar reasons as for cut #2.
Cuts #8 and #9 introduce a veto-time after a muon candidate is detected by its deposited energy in either the water Cherenkov veto or the detector. A muon candidate in the detector is an event with a total energy above 20 MeV. Cuts #8 and #9 reject a large part of muon-induced background such as fast neutrons which can mimic an IBD event by a prompt proton recoil and a subsequent neutron capture of the same neutron or an- other neutron induced by the same muon. Those events are close in time to the incident muon. Likewise, cut #10 allows to reject correlated background events originating from a multiple neutron cascade. Those neutrons are expected to be close in time. The acceptance of these three cuts is directly corrected via the veto-time.
Cut #11 allows to reject stopping muons in addition to the lower time cut introduced by cut #3. To be not already rejected as muon event by cuts #8 or #9, a stopping muon needs to deposit only a small amount of energy in the detector, i.e. it can only have a short track in the detector. Consequently, it must have a vertex distribution very peaked at the top of the detector and hence the distribution of scintillation light is asymmetric compared to events in the bulk of the detector. By looking at the distribution of charge between the four PMTs of one cell, stopping muons can be discriminated as they are likely to produce most scintillation light in the vicinity of just one PMT. This leads to a higher ratio between the maximal charge of one PMT and the total charge of all PMTs than for bulk events. This cut has an acceptance of 99.3 % and shows a small energy distortion of 0.4 %.
Before any selection, we find the ratio of antineutrino interactions in MC to be 39:47:14 for TG, GC and acrylic detector components (mainly Buffers), respectively. The ratio is better than 99:0:1 after applying selection cuts. The overall acceptance in the TG after applying all cuts is 60 % showing an ∼ 4 % relative variation across the prompt energy range. Details can be seen in Figure 17. The central cells (2)(3)(4)(5) show very similar behaviour with an average of 63 % while cell 1 exhibits a few percent lower acceptance due to the loss of neutron detection efficiency at the edge of the detector (cf. Section VIII). Cell 6 is affected by the same effect, but has some additional neutron loss due to the directionality of incident antineu-  trinos that is propagated into the IBD neutrons. The resulting small energy distortion has a common shape for all cells.
In this study, the full set of cuts was compared with the set of cuts reduced by the cut of interest (cf. Equation 3) to get rid of the existing correlation between cuts. The systematic uncertainties of the resulting acceptances induced by the cuts #3 to #7 cannot be derived directly from the uncertainties of the used observables because of the large correlations of the cuts. For example, cuts #5 and #6 use the same cell energy reconstruction. Therefore, the systematic uncertainties were estimated for each cell and energy bin by an MC study. For this study, observables used for the selection of these cuts were randomly varied within their estimated uncertainties before being applied to the IBD MC, resulting in a new number of selected IBD candidates. By redoing the procedure multiple times, distributions of selected IBD events in each cell and each energy bin are produced. The resulting means of the IBD distributions are compared to the number of IBDs obtained without any variation and the differences give the resulting systematic uncertainty for each cell and energy bin. For the sum of all cells, the systematic uncertainties are of the order of 0.5 % below 4.375 MeV and increase up to 2.0 % at 7.875 MeV. As these uncertainties are cell-to-cell correlated, they do no enter the oscillation analysis, which is insensitive to common shifts among detector cells. However, they lay the basis for absolute measurements, to be discussed in future publications. Note that the systematic uncertainties do neither take into account energy scale systematics (cut #1 and #2) nor energy non-linearities which are taken into account separately in the oscillation analysis (Section XI). Moreover, they do not include the delayed neutron detection efficiency (Section VIII) which is also  taken into account separately in the oscillation analysis.

VIII. DELAYED NEUTRON DETECTION EFFICIENCY
As it was introduced in Section II B, antineutrinos are detected in Stereo by their interaction on a proton, followed by a positron annihilation and a delayed neutron capture. To suppress low energy background, mainly neutron captures by Gd are accepted (cut #2 in Section VII). To have a good understanding of the antineutrino interaction, it is necessary to know the efficiency of the selection of neutron captures by Gd nuclei. In this section, the estimation of this term and its implementation in the simulation framework of Stereo are explained.
The fraction of neutron captures by Gd in the detector is evaluated using a neutron source. To build a neutron detection efficiency map of the TG volume, the source is deployed through the internal calibration tubes into the TG cells at five different heights along the Zaxis of the detector (cf. Table I). In the Stereo experiment, an americium-beryllium ( 241 Am/ 9 Be) source is used as a neutron emitter. These particles are produced via 9 Be(α, n) 12 C reactions, where the α particle is created in radioactive decays of 241 Am. In about 60 % of the cases, the carbon nucleus is produced in an excited state, and emits a 4.4 MeV gamma in addition to the neutron [53]. To get a background-free neutron capture sample, this high energy gamma is used to do a delayed coincidence search with the neutron capture. The coincidence selection cuts are applied in a similar way as for the IBD: delayed signals are searched in a time window of 100 µs after a 4.4 MeV prompt candidate. Also, contributions from random coincidences (accidental back-ground) are selected in a similar way as for IBD events (cf. Section IX A). They are subtracted statistically. Figure 18 displays a neutron capture spectrum obtained after this coincidence search and accidental subtraction for an AmBe source. A good agreement between data samples and MC simulations have been reached due to the implementation of the FIFRELIN code [45], improving the description of the de-excitation cascades of the relevant Gd isotopes [46]. The spectra of coincidence times of neutron captures are represented for both, data and MC simulations in Figure 19 showing the agreement of the capture time between both samples.
The efficiency of the detection of neutron captures by Gd is estimated taking into account two different terms. The first is the Gd-fraction (ε Gd ), which takes into account the fraction of events captured by Gd with respect to other elements in the detector. This term depends on the abundance of these elements in the liquid scintillator and their corresponding neutron capture cross-sections. In the TG volume, captures by Gd and H dominate this competitive process. Neutron captures by Gd release a cascade of gammas with energies around 8 MeV, whereas captures by H are followed by a single gamma of 2.2 MeV. Figure 18 shows that the Gd-capture events (N Gd ) are mainly distributed with energies between 3 and 10 MeV (Gd-capture peak and its tail). H-captures (N H ) are distributed with energies in the range [1.5, 3.0] MeV, where the lower energy cut of 1.5 MeV is established to exclude the region with background [54]. Considering these energy cut ranges, the Gd-fraction is defined as However, this neutron capture fraction does not take into account all the selection cuts applied by the coincidence search of an IBD event (cf. Table III). The main selection cuts affecting the delayed events are the energy cut #2, the topological cut #7, and the coincidence time cut on the delayed event #3. In this way, the IBD selection cut efficiency is defined as the fraction of the Gd-capture events fulfilling all of these cuts. This yields the total delayed detection efficiency Neutron efficiency is evaluated in the same way for AmBe calibration data samples and MC simulations at the same source positions. The efficiency terms have been estimated in both samples, and compared to evaluate possible discrepancies caused by imperfections of the MC simulation.
The ratio between data and MC efficiency terms are named correction coefficients and are defined as where x ∈ {Gd, IBD} denotes the specific efficiency term. For the IBD selection cut efficiency, an averaged value of c IBD = 0.9985 ± 0.0059 has been obtained from all the calibration points inside the detector. The uncertainty includes systematic effects from time variations of the 25 detector calibration points in time, inhomogeneities, and displacements of the source position within the deployment accuracy. The averaged value is consistent with unity within one standard deviation.
The Gd-fraction correction coefficient c Gd is not constant throughout the TG volume mostly caused by an imperfect model of the mobility of thermal neutrons in the simulation. It decreases towards the edges of the TG, as can be seen in Figure 20 where the values for an AmBe source deployed at mid-height in each TG cell are presented (cf. Table I). Values differ by 2 % between the central point of the detector and the border regions. From these calibration points, a two-dimensional fit function is derived. The two lateral directions towards the short GC cells can be described by a Subbotin distribution (generalised normal distribution [55]) allowing for a plateau towards the centre of the detector. In Equation 8, µ denotes the mean of the distribution (set equal to zero in the fit), while the parameters σ and β allow scaling and shaping the distribution, respectively. The distribution is already given in a re-normalised form, which is used in the fit. The fitted distribution is represented in Figure 20 with a grey line. A similar effect is expected towards the long GC cells. Therefore, the obtained fit function from the X-axis is also applied to the Y -direction of the detector using an adapted plateau length. With this two-dimensional function, correction coefficients c Gd have been obtained per cell. Since the Stereo detector is not aligned with the antineutrino flux from the reactor core (cf. Figure 1), its orientation has to be taken into account in the computation of the correction coefficients. Weighting the two-dimensional function by the expected distribution of IBD vertices, the c Gd coefficients are found to be 0.9650 and 0.9846 for border and central cells, respectively. These results are slightly lower than results obtained with the AmBe calibration points since the border effect along the Y -axis is now taken into account. Two different types of uncertainties are applied to the c Gd values. On the one hand, a cell-to-cell uncorrelated uncertainty of 0.57 % has been obtained considering time fluctuations, source position bias, and residual discrepancies between model and measured values. On the other hand, a cell-to-cell correlated term has been estimated testing a different fit function for the AmBe efficiency ratios. More sensitive to the uncertainty of the slope of the fit function, cells 1 and 6 have the highest correlated uncertainty term (0.41 %), whereas the impact on central cells within the plateau region is limited to 0.13 % on average (cf. Figure 20).
The total correction coefficients are summarised in Table IV for each cell. Uncertainties are computed considering the previously mentioned systematic effects and cell-to-cell correlations. These correction coefficients contain both, the neutron capture fraction and the impact of the IBD selection cuts. They are taken into account in the oscillation analysis of the Stereo experiment (cf . Table II).

IX. BACKGROUNDS
A. Accidental Background Uncorrelated, i.e. random, combinations of events can pass the selection cuts described in Section VII. Their occurrence depends directly on the frequencies of single prompt and delayed events. The rate and spectrum of these accidental coincidences is evaluated statistically by an off-time method. Each valid prompt candidate is shifted in time by t 1 = 2 ms and the IBD selection is again applied. The time shift is chosen to be large compared to the neutron capture times in the liquids, such that any valid delayed event following the shifted prompt is not correlated to it. The formed pair is then registered in the accidental sample. This step is repeated with g max = 10 different times t g = g · t 1 , 1 < g ≤ g max in order to increase the statistical precision of the accidental background contribution. This method is applied along with the IBD search and monitors then accurately -for each cell and each energy bin -any change of this accidental background, caused by e.g. changes in the configuration of the neighbouring experiments. Variations in the accidental rate up to a factor 4 were recorded between reactor-on and reactor-off periods. We find the prompt spectrum of accidental coincidences to be dominated by low energy gamma events, and reflecting the single events spectrum (cf. Figure 21). Contributions above the highest gamma energy peak of natural radioactivity at 2.6 MeV from Tl mainly come from the cascades following neutron captures on Gd and high-energy γ-rays  from neutron capture on surrounding materials (Al, Fe) during reactor-on periods, due to neighbouring experiments. Correlated coincidences by e.g. boron decays (cf. Section VI E) impose a negligible background. The accuracy of the off-time method was validated by performing an IBD event selection with a long delay time of 2 ms (cut #3, Section VII). Looking at the spectrum of delay times above 1 ms, the study shows that the estimation of the accidental background rate exhibits subpercent bias. Any potential residual bias propagates to the signal extraction, and is maximal for high accidental rates and large differences between reactor-on and reactor-off. In the most unfavourable conditions -encountered in the lowest energy bins -this systematic on the rate of neutrino events is estimated to be less than 0.5 %, and is negligible in the higher energy bins.

B. Model of Correlated Background
Correlated background refers to any physically correlated process similar to an IBD signature. While the accidental background can be estimated along the data flow irrespective of the reactor status, this correlated back-  FIG. 23. Relative rates as a function of the atmospheric pressure. A compatible relative dependence for electronic (green) and proton (red) recoils is found. The set of electronic and proton recoils in this plot is built from all events having a Q tail /Qtot < µγ + 2σγ and Q tail /Qtot > µγ + 2.5σγ, respectively (cf. Figure 22). ground contribution is characterised from the periods where the reactor is turned off. It is therefore mandatory that the combination of the shielding of the detector and the selection cuts suppresses all correlated background connected to the reactor activity. A stringent upper limit of this kind of background is presented in Section IX C. We discuss here the background induced by the cosmic rays, virtually the single source of correlated background.
The residual correlated background after the IBD selection can be further classified as electronic recoils (weakly ionising particles, expected for the IBD signal) and proton recoil background (strongly ionising particles). The latter is the dominant one and arises from fast neutrons interacting in the liquid, inducing a prompt nuclear recoil signal followed by a valid delayed signal from their capture. Their rejection is based on the Pulse Shape Discrimination (PSD) capability of the liquid, whose typical distribution for reactor-off correlated IBD candidates of one energy bin and one cell is shown in Figure 22. The population at low Q tail /Q tot is made up from IBD events, correlated electronic background induced by cosmic rays, and accidental coincidences. The corresponding particles produce electronic recoils in the scintillator, leading predominantly to excitations of π-electrons of the scintillator in the singlet regime. This leads to fast fluorescence pulses. The second population at larger Q tail /Q tot is dominantly caused by muon-induced fast neutrons. Those neutrons produce proton recoils leading to excitations in the triplet regime. This leads to longer de-excitation times. The delay between the start of the integration time window of the entire scintillation pulse (resulting in Q tot ) and the start of the integration time window of the tail of the scintillation pulse (resulting in Q tail ) was chosen such that the best separation of the two   24. (Top) Comparison of pulse shape distributions for the reactor-off dataset of phase-II split into two bins of high and low atmospheric pressure (this figure) and for the two observed values of filling level of the reactor water pool (Figure 25). In both figures, the OFF1 spectra are normalised to their acquisition times, while the OFF2 spectra are scaled by a parameter a. These parameters are determined from fits to the OFF1 spectra allowing only for scaling of the OFF2 spectra. (Bottom) Ratio between the OFF1 and the scaled OFF2 spectra. In both figures, the spectra are compatible in shape within uncertainties.
populations in terms of Q tail /Q tot was achieved.
It has been found that the PSD is linearly anticorrelated to the liquid temperature via changes in its density [56]. The temperature variation can reach 3 • C on a two months scale (typical reactor-on period). To correct from these drifts, the observable Q tail /Q tot is monitored along time for each energy bin and each cell using the high-statistic sample of single events triggering the detector. Such a reference distribution is shown on the bottom plot of Figure 22. Dominated by electronic recoils, this population is fitted by a normal distribution, where the extracted parameters µ γ and σ γ denote the mean and the standard deviation of the electronic recoil population. The single events thus provide a precise monitoring of the temperature dependence. The correction of a global offset of the PSD distributions is implemented on a daily basis. Electronic recoils are then defined with respect to the singles reference as Q tail /Q tot < µ γ + 2σ γ and proton recoils as Q tail /Q tot > µ γ + 2.5σ γ (cf.   Figure 24 for details. The difference in the total number of days between Figure 24 and this figure arises from the exclusion of 9 days with changing water level in the channel above the detector. ure 22). Note that this separation is only used for investigations of the background, but not for the extraction of the neutrino signal presented in Section X.
The prompt background spectrum contaminating the electronic recoil component can be obtained by integrating the rate of events for Q tail /Q tot < µ γ + 2σ γ in each energy bin. The resulting distribution is presented in Figure 21. Series of neutron captures in the liquid surviving the isolation requirement (cut #10) are responsible for the purely electromagnetic contamination peaked at 2.2 MeV. They are associated to neutron captures by hydrogen. The prominent feature around 5.4 MeV arises from the 12 C(n,n'γ) 12 C reaction in the liquid. Unlike the previous one, this contribution is not purely electromagnetic because of the presence of a subsequent neutron recoil along with the 4.4 MeV gamma in the prompt signal. Combining these background spectra with the shape of the detected IBD spectrum, the signal-to-background ratios for the different contributions are illustrated in  The correlated background was found to be sensitive to environmental conditions, as expected because of its cosmic-induced origin. As shown in Figure 23, rates of electronic recoil and proton recoil events are correlated to the atmospheric pressure. The linear correlation coefficient is found to be compatible between the two populations. A dependence on the filling level of the water pool above the reactor core was also observed, but the greatest effect (due to a 7 m change in the water level) is 16 times weaker than due to a 30 hPa atmospheric pressure variation.
To study the influence of these changes on the PSD distribution, the reactor-off dataset is subdivided into subsets of data, classified by atmospheric pressure and filling level of the water pool above the reactor core. The shape of the PSD distributions remains very stable in all conditions of atmospheric pressure (cf. Figure 24) and water level (cf. Figure 25). The fit of a single normalisation factor a is enough to accurately overlap the distributions of two subsets of data with opposite variations of the environmental parameters.
Looking at the PSD distribution of all events of an AmBe source, we find a small shift in Q tail /Q tot relative to the distribution of single events over time. For our dataset, the shift can be modelled by a linear function and it is about 2 % per year. Most of this shift is found to be cell-to-cell correlated and it is therefore not affecting the oscillation analysis, which is insensitive to common shifts among detector cells. Exploiting the interleaved sequence of the reactor-on and reactor-off periods in the dataset (cf. Figure 3), we estimate a remaining 0.3 % cellto-cell uncorrelated uncertainty on the normalisation of neutrino rates. Compared to other uncertainties of this type, the effect does not contribute significantly (cf. Table II). We assume it to be covered by the other, already conservative, uncertainties.
The best-fit value of the normalisation factor a between reactor-on and reactor-off periods (analogous to the comparisons of different reactor-off datasets in Figures 24 and 25) can serve as a probe of potential reactorinduced background. Such a correlated background could be induced by fast neutrons producing a prompt proton recoil and a delayed capture signal. If sizeable, it would prevent the use of the background PSD spectra measured during reactor-off periods as models of background shape for the reactor-on periods. From the sensitivity to atmospheric pressure shown in Figure 23 and the equivalent measurement for the water level in the pool, it is possible to project all reactor datasets to the same environmental parameters. Once these corrections are applied, the relative difference of proton recoil rates between reactor-off and reactor-on phases is presented in Figure 26. Data points at high energy are compatible with zero while a small excess is observable towards low energy. In the worst case, the excess exists only for proton recoils and it would directly propagate into an underestimation of the Relative difference of proton recoil rates, corrected for the measured dependence on the atmospheric pressure and water level, between reactor-on and reactor-off periods. To avoid a contribution of IBDs, only events with Q tail /Qtot > µγ + 3.0σγ are considered (cf. Figure 22). The grey line represents the fit function as detailed in Equation 10.
IBD rates in the first energy bins since the antineutrino extraction method assume a constant shape of the PSD distribution (cf. Section X). In the best case, the excess is the same for proton and electronic recoils and it would not affect the IBD rates. We consider fast neutrons from the reactor or a neighbouring instrument (cf. Figure 1) as the most likely origin of a hypothetical additional background, producing prompt proton recoils at low energy. Allowing for an offset to account for a possible residual bias in the projection to the same environmental parameters, the excess is fitted by a power law where E is the reconstructed energy and u, v, and w are free fit parameters. Based on the most likely scenario, we correct for the full bias taking into account the energydependent signal-to-background ratios (cf. Figure 21). The correction is 3.6 % and 1.8 % in the first two energy bins, respectively, and is negligible in higher bins. We get the same results when fitting the excess with an exponential function. However, if the excess should be the same for proton and electronic recoils as expected in other scenarios, the IBD rates would not be affected. To cover this case, an uncertainty equal to the size of the correction is conservatively added in each bin as additional systematic uncertainty on the IBD rates. It is small compared to the statistical uncertainty.
Stable in shape over time, the correlated PSD distributions of IBD candidates obtained from the reactor-off dataset provide a model of the background that can be directly used for the extraction of the antineutrino signal.

X. ANTINEUTRINO SIGNAL EXTRACTION
Based on the stability of the offset-corrected PSD spectra established in Section IX, the number of IBD candidates is extracted for every cell l and energy bin i from the PSD distributions measured in the reactor-on and reactor-off phases. The PSD distributions of IBD candidates and accidental coincidences measured with reactor-off (reactor-on) are denoted as OFF l,i (ON l,i ) and OFF acc l,i (ON acc l,i ), respectively. The PSD distribution of correlated background pairs (without accidental coincidences) is modelled by m cor,OFF l,i for reactor-off periods. For reactor-on periods, the same model is used, only allowing for a normalisation factor a l,i which accounts for different measuring times (including different dead-time corrections), mean atmospheric pressures and reactor pool levels. In contrast, specific models m acc,OFF l,i and m acc,ON l,i are used to describe the accidental coincidences because of the different single spectra. The PSD distribution of IBD events is described by a scaled normal distribution G ν (A l,i , µ l,i , σ 2 l,i ) with mean µ l,i and standard deviation σ l,i . The scaling factor A l,i directly measures the IBD rate. Separately for each cell l and energy bin i, the four measured spectra are fitted simultaneously for all PSD bins p where ON l,i;p and OFF l,i;p are nonzero (cf. Figure 27 ON acc l,i;p = m acc,ON OFF acc l,i;p = m acc,OFF l,i;p (14) where f acc,ON and f acc,OFF (both ∼1/10) account for the number of time shifts used in the accidentals extraction as well as for the different suppression-times of accidentals in the coincidence window compared to the off-time windows used to measure the accidentals. They are different because accidental signals can be suppressed by correlated events in the coincidence window, but not in the off-time windows. The fit parameters a l,i , A l,i , µ l,i , σ l,i , and m ζ l,i;p , ζ ∈ {corr,OFF; acc,ON; acc,OFF} are free and are determined independently for the two Stereo phases. Because of the low-statistics regime in most of the PSD bins, a binned log-likelihood maximisation is used [56]. The number of IBDs A l,i is extracted for each cell l and 500 keV energy bin i separately. Compared to the PSD fit used in phase-I [11], the new method does not assume a particular analytical shape of the PSD distribution for background events. The description of the electronic and proton recoil component by single normal distributions turned out to be insufficient at higher statistics [57]. Stereo phase-I data were re-analysed using the new method. Because of the smaller statistics, in particular in the background sample, the standard deviations σ l,i of the IBD PSD distributions in phase-I were constrained for each cell to F l σ T,i ± √ 6 σ σ T ,i implemented as pull term. Here, σ T,i is the standard deviation of the IBD PSD distribution in the TG, σ σ T ,i its statistical uncertainty and F l a factor that accounts for the different PSD resolution of cell 4 (F 4 = 1.25) compared to the other cells (F l = 0.95, l = 4) in phase-I.
It is known that in the low statistics regime the estimator of the rate of IBDs can be biased [58]. The biases of each energy bin and cell are studied from simulations reproducing the fitting procedure and including all relevant experimental conditions (IBD rates, signalover-background ratios, background shape, binning size etc.). For one pseudo-experiment, the bias is defined as the difference between the number of fitted antineutrinos and the number of generated antineutrinos, relative to Values are shown per cell and per energy bin, as well as for the total spectrum ("Target"). The estimates of the IBD rates are corrected from this bias.
the number of generated antineutrinos. Examining over 5000 pseudo-experiments, this quantity is distributed following a normal law from which the mean position is extracted and reported in Figure 28. Below 6.5 MeV, the magnitude lies below 1 %, while it reaches up to 2 % in the last energy bin, where the ratio between signal and background events becomes smaller. For phase-I, with its lower statistics, the biases were typically below 2 % but reached up to 6 % in the last energy bin. The bias is corrected during the extraction of IBD candidates. In the case of the total spectrum, the PSD distributions of the six cells are summed before the fit leading to higher statistics and therefore a smaller bias, visible in Figure 28 as black triangles.

XI. OSCILLATION ANALYSIS
The oscillation analysis is performed in 11 equidistant energy bins of 500 keV width between 1.625 MeV and 7.125 MeV. The estimated number of measured IBD events for each of the six cells, extracted as described in Sections VII and X, is compared with their respective expectations allowing for sterile neutrino oscillations. The main effect on the number of IBDs in each cell is given by the dependence of the antineutrino flux on the distance L from the reactor core. Figure 29 shows a very good agreement with a quadratic behaviour when plotting the measured numbers of IBD events from all energy bins of cell l over the mean cell baseline L l .
To probe the signal of sterile neutrino oscillations, a relative comparison between cell spectra is performed via  29. Comparison of the measured number of IBD events N ν,l in each cell l (black) with a fit of these data points using a 1/L 2 model including a free normalisation (red). For each cell, the mean cell baseline L l is used. It is calculated by taking into account the solid angle of incident antineutrinos. The data points have been corrected with the detection efficiency det,l for each cell.
the following ∆χ 2 formalism: Since we do not want to rely on absolute rate predictions in our analysis, the φ i are introduced as free normalisation parameters for each energy bin i. They effectively adjust the number of expected IBDs M l,i across all cells l to match the number of measured IBDs D l,i on average. Since the φ i absorb all absolute rate information per energy bin i, the analysis becomes independent of the spectrum prediction. Simultaneously with the φ i , the M l,i are optimised in terms of oscillation parameters [sin 2 (2θ ee ), ∆m 2 41 ] and nuisance parameters α to match = M l,i sin 2 (2θ ee ), ∆m 2 41 · 1 + α NormU l +S Escale The dependence of M l,i on the oscillation parameters is given by Equation 1. This equation is applied individually to each MC event before it is registered in its corresponding bin M l,i . The parameter α EscaleC and α EscaleU l account for the cell-to-cell correlated and uncorrelated energy scale uncertainties, respectively, while the additional nuisance parameter α NormU l accounts for the cell-to-cell uncorrelated normalisation uncertainties (cf. Table II). All nuisance parameters are constrained in Equation 15 by their corresponding uncertainties via pull terms. Cell-to-cell correlated normalisation uncertainties do not need to be included since the free normalisation parameters φ i compensate any cell-to-cell correlated normalisation effect. α EscaleC and α EscaleU  Equation 15) for the best-fit to the no-oscillation model in phase-I+II. All parameters are contained within ± 1 standard deviation with respect to their corresponding pull terms.
by the free normalisation parameters φ i .
The statistical uncertainty in Equation 15 is given as a function of the expected IBDs φ i M l,i . This is due to the fact that the scan of the parameter space of neutrino oscillations is a particular case in which the model deviates significantly from the experimental values when testing very large mixing angles. For this reason, σ l,i cannot be approximated by the statistical uncertainty of the data anymore and must evolve with the rate predicted by the model. The same simulation as the one developed for the study of likelihood biases (Section X) is used for this purpose. It estimates the statistical uncertainty of the IBD signal as a function of the expected rate, under real conditions of background rate and spectral shape. This subtlety explains why the nuisance parameter α EscaleC , describing cell-to-cell correlated shifts in the energy scale, is kept in the expression of the χ 2 (cf. Equations 15 and 16) although, in principle, it is insensitive to correlated uncertainties. Here, the oscillation parameters induce different spectral distortions in each cell. Hence, α EscaleC , although common to all cells, propagates differently in the expected rates M l,i and associated statistical uncertainties σ l,i .
The no-oscillation scenario was tested by generating  pseudo-data are generated as fluctuations around the expected non-oscillated values within their uncertainty. Then, the value of ∆χ 2 sin 2 (2θ ee ), ∆m 2 41 = (18) χ 2 sin 2 (2θ ee ), ∆m 2 41 ,ˆ α − χ 2 sin 2 (2θ ee ), ∆m 2 41 , α is calculated, where all parameters are allowed to vary with exception of sin 2 (2θ ee ) and ∆m 2 41 . They are the parameters of the model to be tested, e.g. they are equal to zero in the case of the no-oscillation hypothesis.
As explained in Section III, the performance of the Stereo detector has changed between phase-I and phase-II. Therefore, we treat phase-I and phase-II as two separate experiments. In Equation 18, we use the same χ 2 formula (Equation 15) for each one separately, but minimising the sum of the two at the same time: +χ 2 PII sin 2 (2θ ee ), ∆m 2 41 , α PII , φ i .
The changed response of the detector between the phases is taken into account in the simulation. As the systematic uncertainties evolved between the two phases, the nuisance parameters α are different. However, we keep the same normalisation parameters φ i as we expect the adjustments of the spectrum to be the same for the two phases. To be able to do so, we have to add a normalisation term A which is common to all cells. We implement it for phase-I, where the parameters φ i are replaced by  the expression φ i A in Equations 15 and 17, respectively. The normalisation parameter A is adjusted freely during the minimisation and takes into account the difference in statistics between the two datasets. The parameters of interest sin 2 (2θ ee ) and ∆m 2 41 are kept identical for the two phases.
To test the no-oscillation hypothesis, we compare the ∆χ 2 of the data from phase-I+II with the distribution obtained from pseudo-experiments. We get a p-value of 9 %, i.e. the no-oscillation hypothesis cannot be rejected. Figure 30 shows the corresponding best-fit spectra for phase-II data relative to the re-normalised no-oscillation model when simultaneously fitting data from phase-I+II. We report the best-fit values of all nuisance parameters in Figure 31. They are all contained within one standard deviation of their central values.
While computing the ∆χ 2 values of the pseudoexperiments, we find that the fitting procedure tends to describe the generated pseudo-data using high values of sin 2 (2θ ee ) and ∆m 2 41 , which are located in a parameter space region already excluded by the global analysis [14]. We explain this by an overfitting of the statistical fluctuations present in the data. By going to high values of sin 2 (2θ ee ) and ∆m 2 41 , the fit model is able to match individual fluctuations. When introducing a constraint on the normalisation of the overall spectrum, the fit model no longer employs high values of sin 2 (2θ ee ) and ∆m 2 41 to describe the pseudo-data. For this study, we introduced a pull term similar to the ones in Equation 15, where the constrained parameter is the sum of the free normalisation parameters φ i across all cells (cf. Equation 15). Instead of constraining this sum to the integral of the predicted IBD spectrum presented in Section IV, we constrain this sum, for each pseudo-experiment, to the sum of the free normalisation parameters calculated for the pseudo-experiment under the assumption of no oscillations. In this study, we only impose a mild constraint of 20 % relative uncertainty with respect to the reference. This constraint is chosen much looser than it would be possible from the uncertainty of the absolute rate. Note that we utilised the normalisation constraint only for this study of overfitting, but not in the determination of any oscillation result presented in this article.
To draw exclusion contours, a similar approach to Equation 18 is performed for each parameter space point to be probed. However, due to the computationally intensive nature of the approach, the parameter space is scanned in a raster of several fixed ∆m 2 41 values. By plotting the rejected intervals of sin 2 (2θ ee ) for each ∆m 2 41 value along the ordinate, we achieve the exclusion region depicted in Figure 32. In particular, we reject the best-fit point of the RAA at more than 99.9 % C.L. for its ∆m 2 41 value.
The oscillatory structure of the exclusion contour (red) around the expected mean sensitivity (blue) in Figure 32 can be interpreted as statistical fluctuation. By replacing the real dataset with several pseudo-datasets, we are able to generate pseudo-exclusion contours which show a similar oscillatory structure around the expected mean sensitivity, but whose extrema are located at different values of ∆m 2 41 . A study of the sensitivity expected at the end of the Stereo experiment has been realised using the same method. To predict the sensitivity, we use the statistical uncertainties, calculated as described before, for the number of days expected at the end of the experiment. The expected number of days with reactor-on (off), normalised to the nominal reactor power, is 300 (500) days. The final expected sensitivity is shown in Figure 33. This sensitivity is calculated assuming the systematic uncertainties derived from the data acquired until now. It may improve in the future if systematic uncertainties further reduce. We expect the statistical uncertainties to be comparable to the systematic uncertainties in the final dataset of the Stereo experiment.
All results of the approach given by Equation 15 have been confirmed by an independent ∆χ 2 method, where all uncertainties are modelled by a covariance matrix instead of nuisance parameters [60].
The exclusion and mean sensitivity contours along with maps of ∆χ 2 and critical ∆χ 2 values at different confidence levels are available as supplementary material in reference [59]. When combining with other experimental data, special care should be exercised. The assumption of a standard χ 2 law instead of the provided critical ∆χ 2 values derived from non-standard χ 2 distributions leads to slightly modified contours. In addition, the contours were derived using a raster-scan in several fixed values of ∆m 2 41 . While this method is particularly suited to derive exclusion contours, it cannot be used to calculate allowed confidence regions for ∆m 2 41 and consequently two-dimensional allowed confidence regions. This is because ∆χ 2 values are not reflecting the likelihood of individual ∆m 2 41 values. Thus, a direct comparison of ∆χ 2 values across different ∆m 2 41 values is not possible in a statistically meaningful way.

XII. CONCLUSION
After the initial results of phase-I of the Stereo experiment [11] based on 66 days of reactor-on data, the dataset was extended to include in total 179 days of reactor-on and 235 days of reactor-off data in phase-I+II. The separation walls of the detector were largely improved during the long reactor shutdown between the two phases, and the internal calibration system of the detector enhanced. In addition, several aspects of the data analysis and evaluation of systematic uncertainties were improved. The optical model was updated for phase-II and the optical properties of various detector components were refined. Here, special emphasis was given to the modelling of separation walls and the light cross-talk between the cells of the TG volume. The accuracy of the simulation of the reactor flux and energy spectrum was improved by utilising more precise calculations of corrections to the Huber model. Special care was given to the dominant contribution from antineutrinos originating from β-decays of 28 Al and 56 Mn across the entire reactor heavy water tank. Due to an improved optical model in the detector simulation as well as the exploitation of the energy spectrum of 12 B events, the energy scale of the experiment underwent a reduction of uncertainty across the entire energy range and TG volume. The selection uncertainties of IBD events were studied in great detail and corrections to the MC model were derived. In particular, a three-dimensional correction map for the neutron selection efficiency in the entire TG was determined, correcting for an imperfect description of the neutron mobility in the simulations. In addition, a more precise simulation of the gamma-cascades after neutron captures on Gd exploiting the new FIFRELIN simulation package was developed. Systematic uncertainties in the background description were reduced by fitting in-situ measured PSD distributions of background acquired in reactor-off and reactor-on phases. Various systematic effects in this method were investigated and are accounted for accordingly. The statistical oscillation analysis of the resulting IBD spectra was performed by a predictionindependent procedure with free normalisation factors which is insusceptible to cell-to-cell correlated systematics. We exclude the best-fit point of the RAA at more than 99.9 % C.L. by combining our phase-I and phase-II datasets on a statistical basis using the aforementioned method. For our expected final dataset, we estimate our final 95 % C.L. exclusion sensitivity to fully cover the relevant ∆m 2 range of the 95 % allowed region of the RAA.

DATA AVAILABILITY
To allow inclusion of this work into global oscillation analyses and further works, we provide various results of our analysis in digitised form. These supplements can be found in reference [59].