Measurement of the B 0 s → μ + μ − decay properties and search for the B 0 → μ + μ − and B 0 s → μ + μ − γ decays

An improved measurement of the decay B 0 s → μ þ μ − and searches for the decays B 0 → μ þ μ − and B 0 s → μ þ μ − γ are performed at the LHCb experiment using data collected in proton-proton collisions at ﬃﬃﬃ s p ¼ 7 , 8 and 13 TeV, corresponding to integrated luminosities of 1, 2 and 6 fb − 1 , respectively. The B 0 s → μ þ μ − branching fraction and effective lifetime are measured to be B ð B 0 s → μ þ μ − Þ ¼ ð 3 . 09 þ 0 . 46 þ 0 . 15 − 0 . 43 − 0 . 11 Þ × 10 − 9 and τ ð B 0 s → μ þ μ − Þ ¼ ð 2 . 07 (cid:2) 0 . 29 (cid:2) 0 . 03 Þ ps, respectively, where the uncertainties include both statistical and systematic contributions. No significant signal for B 0 → μ þ μ − and B 0 s → μ þ μ − γ decays is found and the upper limits B ð B 0 → μ þ μ − Þ < 2 . 6 × 10 − 10 and B ð B 0 s → μ þ μ − γ Þ < 2 . 0 × 10 − 9 at 95% confidence level are determined, where the latter is limited to the range m μμ > 4 . 9 GeV =c 2 . Additionally, the ratio between the B 0 → μ þ μ − and B 0 s → μ þ μ − branching fractions is measured to be R μ þ μ − < 0 . 095 at 95% confidence level. The results are in agreement with the Standard Model predictions. DOI: 10.1103/PhysRevD.105.012010


I. INTRODUCTION
Decays mediated by a quark flavor-changing neutral interaction are not allowed at tree level in the Standard Model (SM) of particle physics but can proceed through quantum loops, making them rare processes. The leptonic B 0 → μ þ μ − and B 0 s → μ þ μ − decays (the inclusion of charge-conjugated processes is implied throughout this paper) are even rarer because they are additionally helicity-suppressed. As they are characterized by a purely leptonic final state, and thanks to the progress in lattice QCD calculations [1][2][3][4][5], their time-integrated branching fractions are predicted in the SM with small uncertainties to be BðB 0 s → μ þ μ − Þ ¼ ð3.66 AE 0.14Þ × 10 −9 and BðB 0 → μ þ μ − Þ ¼ ð1.03 AE 0.05Þ × 10 −10 [6,7]. This makes these processes powerful probes for detecting deviations from the SM due to new physics (NP) contributions mediated, for instance, by heavy Z 0 gauge bosons, leptoquarks or non-SM Higgs bosons (see e.g., Ref. [8]).
An effective field theory description of b → sμ þ μ − transitions makes it possible to tightly constrain the currents contributing to their amplitudes in a modelindependent way. In this framework, the branching fractions of B 0 → μ þ μ − and B 0 s → μ þ μ − decays are sensitive to axial-vector, scalar and pseudoscalar operators and their chirality-flipped counterparts [9,10]. Of these, only the lefthanded axial-vector current is present in the SM at a significant level. Examples of SM Feynman diagrams contributing to the B 0 s → μ þ μ − amplitude are shown in Figs. 1(a) and 1(b).
Given the low rate of the B 0 → μ þ μ − and B 0 s → μ þ μ − decays, their branching fractions are measured without distinguishing between B 0 ðsÞ andB 0 ðsÞ at production. Both B 0 and B 0 s mesons oscillate into their antiparticles but, in contrast to the B 0 system, the light and heavy mass eigenstates of B 0 s mesons are characterized by a sizeable difference between their decay widths, ΔΓ s ¼ 0.085 AE 0.004 ps −1 [11], and thus have different lifetimes. This gives rise to the relation [12] BðB 0 between the flavor-untagged and time-integrated branching fraction and the value at decay time t ¼ 0. The theoretical prediction mentioned at the beginning of this section includes this correction. In Eq. (1), y s ≡ ΔΓ s =ð2Γ s Þ ¼ 0.065 AE 0.003 [11] and the parameter A μμ ΔΓ s is defined as A μμ ΔΓ s ≡ −2RðλÞ=ð1 þ jλj 2 Þ, with λ¼ðq=pÞðAðB 0 s →μ þ μ − Þ= AðB 0 s →μ þ μ − ÞÞ. The complex coefficients p and q relate the mass and CP eigenstates of the B 0 s −B 0 s system with the flavor eigenstates B 0 s andB 0 s (see, e.g., Ref. [11]), and A is the amplitude of the process. In the SM, only the CP-odd eigenstate (which, except for small modifications from CP violation, corresponds to the heavy mass eigenstate) decays to μ þ μ − and the quantity A μμ ΔΓ s is equal to unity. However, in the presence of NP contributions it can assume any value in the range −1 ≤ A μμ ΔΓ s ≤ 1 [12]. Thus the B 0 s → μ þ μ − branching fraction might differ from the SM prediction in either of the two factors in the right-hand side of Eq. (1).
The B 0 s → μ þ μ − effective lifetime is defined as where t is the decay time of the B 0 s orB 0 s meson, the decaytime distribution ΓðB s ðtÞ → μ þ μ − Þ for B 0 s → μ þ μ − andB 0 s → μ þ μ − decays with and without oscillations is defined as ΓðB s ðtÞ → μ þ μ − Þ ≡ ΓðB 0 s ðtÞ → μ þ μ − Þþ ΓðB 0 s ðtÞ → μ þ μ − Þ, and τ B 0 s ¼ 1.515 AE 0.004 ps [11] is the mean B 0 s lifetime. By measuring the B 0 s → μ þ μ − effective lifetime, the contribution of each mass eigenstate, and thus the CP structure of the interaction involved in the decay, can be inferred, and a direct evaluation of A μμ ΔΓ s can be performed. The lifetime thus makes it possible to discriminate between contributions from scalar or pseudoscalar interactions in a complementary way to the branching ratio. Similar effects are not significant for B 0 → μ þ μ − decays due to the negligible decay width difference of the B 0 mass eigenstates.
The ratio of the B 0 → μ þ μ − and B 0 s → μ þ μ − branching fractions also provides powerful discrimination between NP theories [13]. This quantity is theoretically more precise than the two individual branching fractions due to the cancellation of common theoretical uncertainties. It can be obtained as where τ B 0 is the lifetime of the B 0 , Γ s H is the width of the heavy-mass eigenstate of the B 0 s meson, M B 0 and M B 0 s are the masses of the B 0 ðsÞ mesons, f B 0 and f B 0 s are the B 0 ðsÞ meson decay constants, V td and V ts are Cabibbo-Kobayashi-Maskawa (CKM) matrix elements and m μ is the mass of the muon. In the SM, R μ þ μ − is predicted to be 0.0281 AE 0.0016 [7] and it assumes the same value in all NP models with the same flavor structure as the SM [14]. The B 0 s → μ þ μ − γ decay is also rare in the SM. Compared to the B 0 s → μ þ μ − amplitude, the additional suppression arising from the photon is compensated by the fact that the amplitude is no longer helicity-suppressed, increasing the total predicted branching fraction to Oð10 −8 Þ [15][16][17][18][19][20][21]. Two groups of amplitudes contribute to this decay: those where the photon is emitted from the initial state (initialstate radiation or ISR), an example of which is shown in Fig. 1(c), and those in which it is emitted from the final state (final-state radiation, FSR), as in Fig. 1(d). Their interference is evaluated to be negligible due to their combined helicity and kinematic suppression [18,19,22]. The FSR contribution to the B 0 s → μ þ μ − γ process is experimentally included in the B 0 s → μ þ μ − decay through the description of the radiative tail in its mass distribution. The ISR component is sensitive to a wider range of interactions and is treated as a separate contribution to the mass fit. In the mass region of interest its contribution decreases as the mass increases, becoming null for values larger than the B 0 s mass. Similar to other multibody b → sll decays, the sensitivity to different interactions depends on the dimuon mass squared, q 2 , of the decay. At low q 2 , the decay is mostly sensitive to tensor and pseudotensor interactions, while at high q 2 vector and axial-vector contributions dominate [23,24]. This makes the ISR B 0 s → μ þ μ − γ decay at high q 2 an ideal place to probe the same interactions that drive the anomalies seen in some b → sll decays [25][26][27][28]. In the rest of this paper, B 0 s → μ þ μ − γ refers only to the ISR process.
This paper presents improved measurements of the B 0 s → μ þ μ − time-integrated branching fraction and effective lifetime, as well as a search for the B 0 → μ þ μ − decay, superseding the results in Ref. [32]. Moreover, a first search for the B 0 s → μ þ μ − γ decay at high dimuon mass is also presented. Due to the large uncertainty on the form factors, no attempt is made to extrapolate the B 0 s → μ þ μ − γ result to the full dimuon range. These results, also reported in Ref. [37], are based on data collected with the LHCb detector, corresponding to an integrated luminosity of 1 fb −1 of pp collisions at a center-of-mass energy ffiffi ffi s p ¼ 7 TeV, 2 fb −1 at ffiffi ffi s p ¼ 8 TeV and 6 fb −1 at ffiffi ffi s p ¼ 13 TeV. The first two datasets are referred to as Run 1 and the latter as Run 2. Throughout this paper, s → μ þ μ − γ decays with the dimuon pair selected in the mass range ½4900; 6000 MeV=c 2 and the photon not reconstructed. The contribution from B 0 → μ þ μ − γ decays is considered negligible compared to that from B 0 s → μ þ μ − γ because of the additional CKM suppression and the mass shift to lower values.

II. ANALYSIS STRATEGY
The signature of B 0 ðsÞ → μ þ μ − decays in the LHCb detector consists of two oppositely charged muons with a dimuon mass in the B 0 s or B 0 mass region, and a decay vertex displaced with respect to any pp interaction vertex as a result of the significant average flight distance of the B mesons. The B 0 s → μ þ μ − γ channel is searched for with the same signature, without reconstructing the photon, as was proposed in Ref. [22].
The main background can be divided into two categories: combinatorial background arising from random combinations of muons from two distinct b-hadron decays in the same event, and physical background comprising b-hadron decays where one or more final state particles has either been misidentified as a muon or not reconstructed. Combinatorial background candidates are distributed across the entire search region from low to high mass, while physical background contributions tend to populate the region below the B 0 s mass. The dominant physical background sources are: B 0 ðsÞ → h þ h 0− decays where the hadrons h; h 0 ¼ K, π are misidentified as muons, which mainly contribute to the B 0 mass region; and partially reconstructed b-hadron decays, which populate the same lower dimuon mass region as the B 0 s → μ þ μ − γ signal. The most important partially reconstructed background sources are semileptonic Combinatorial background is separated from the B 0 ðsÞ → μ þ μ − signal by exploiting differences between their topologies and the relative isolation from other tracks in the event of the muons forming the B candidate. This information is combined in a multivariate classifier based on a boosted decision tree [38], the output response of which, BDT, is used to classify the events as described in Sec. IV. A stringent particle identification (PID) requirement is used to suppress physical background from B 0 ðsÞ → h þ h 0− and semileptonic decays, described in Sec. VII.
As presented in Sec. VIII, the signal yields are estimated using an extended unbinned maximum-likelihood fit to the dimuon mass distribution, which is performed simultaneously in intervals of the BDT response to increase the sensitivity of the measurement. The BDT and mass distributions of the signals are calibrated and validated using data, as detailed in Sec. V.
To measure the branching fractions, the yields of The measurement of the B 0 s → μ þ μ − effective lifetime uses a similar selection strategy, which is optimized to achieve the highest sensitivity. After the selection, a maximum-likelihood fit is performed to the dimuon mass distribution in two BDT regions to subtract the background. The decay-time acceptance in each BDT region is calibrated on corrected simulation samples and validated by applying the analysis procedure to B 0 ðsÞ → h þ h 0− candidates from data. Finally, the effective lifetime is extracted using a maximum-likelihood fit to the background-subtracted decay-time distribution, performed simultaneously across both BDT regions, as presented in Sec. IX.

III. DETECTOR AND SIMULATION
The LHCb detector [39,40] is a single-arm forward spectrometer covering the pseudorapidity range 2 < η < 5, designed for the study of particles containing b or c quarks. The detector includes a high-precision tracking system consisting of a silicon-strip vertex detector surrounding the pp interaction region, a large-area silicon-strip detector located upstream of a dipole magnet with a bending power of about 4 Tm, and three stations of silicon-strip detectors and straw drift tubes placed downstream of the magnet. The tracking system provides a measurement of the momentum, p, of charged particles with a relative uncertainty that varies from 0.5% at low momentum to 1.0% at 200 GeV=c. The minimum distance of a track to a primary pp collision vertex (PV), the impact parameter (IP), is measured with a resolution of ð15 þ 29=p T Þ μm, where p T is the component of the momentum transverse to the beam, in GeV=c. Different types of charged hadrons are distinguished using information from two ring-imaging Cherenkov detectors. Photons, electrons and hadrons are identified by a calorimeter system consisting of scintillating-pad and preshower detectors, an electromagnetic and a hadronic calorimeter. Muons are identified by a system composed of alternating layers of iron and multiwire proportional chambers. The online event selection is performed by a trigger, which consists of a hardware stage, based on information from the calorimeter and muon systems, followed by two software stages. The first software stage performs a preliminary event reconstruction using only part of the available event information, while the second stage performs a full event reconstruction.
Simulation is used to estimate the acceptance, reconstruction and selection efficiencies and to optimise the analysis strategy. The pp collisions are generated using PYTHIA [41] with a specific LHCb configuration [42]. Decays of particles are described by EVTGEN [43]. Decays of B þ c mesons are generated using the dedicated BCVEGPY generator [44,45]. Final-state radiation in the decay of particles is simulated using PHOTOS [46], which is observed to agree with a full quantum electrodynamics calculation at the level of 1% [47]. ISR B 0 s → μ þ μ − γ decays are simulated according to the study in Ref. [18]. The interaction of the generated particles with the detector, and its response, are implemented using the GEANT4 toolkit [48], as described in Ref. [49].

IV. SIGNAL SELECTION
In the online event selection, signal candidates are first required to pass the hardware trigger, which selects events with at least one muon with high transverse momentum, followed by a two-level software stage, which applies a full event reconstruction. The software stage imposes minimum requirements on the muon transverse momentum and impact parameter with respect to all PV. However, to maximize the signal selection efficiency, events triggered by particles not related to the signal candidates are also retained for further analysis. Events used in the branching fraction measurement may also be triggered by a combination of particles forming the signal candidate and particles from the rest of the event. However, events that are only triggered from such a combination of signal and background particles are excluded from the effective lifetime measurement in order to ensure accurate modeling of the trigger efficiency as a function of decay time in simulation.
Candidate B 0 ðsÞ → μ þ μ − decays are selected offline by combining two well-reconstructed oppositely charged particles identified as muons [50], with transverse momentum in the range 0.25 < p T < 40 GeV=c, and momentum p < 500 GeV=c. The muon candidates are required to form a secondary vertex (SV) with a vertex-fit χ 2 per degree of freedom smaller than 9 and separated from any PV with a significance greater than 15. Only muon candidate tracks with χ 2 IP > 25 for any PV are selected, where χ 2 IP is defined as the difference between the vertex-fit χ 2 of the PV formed with and without the particle in question.
The resulting B 0 ðsÞ candidates must have a decay time lower than 13.25 ps, χ 2 IP < 25 with respect to the PV for which the χ 2 IP is minimal (henceforth referred to as the PV associated with the B 0 ðsÞ candidate) and p T > 0.5 GeV=c. To suppress the B þ c → J=ψμ þ ν μ background, a B 0 ðsÞ candidate is rejected if either of the two candidate muons combined with any other oppositely charged muon candidate in the event has a mass within 30 MeV=c 2 of the J=ψ mass [51] (J=ψ veto). Further requirements on the particle identification (PID) information of the two muons are imposed in order to reject misidentified hadronic background. PID identification uses multivariate techniques to combine information from different subsystems taking correlations into account [40].
Candidate B 0 ðsÞ → μ þ μ − decays used in the branching fraction measurements are selected in the dimuon mass range 4900 ≤ mðμ þ μ − Þ ≤ 6000 MeV=c 2 , while those used in the lifetime measurement are selected in a narrower range, 5320 ≤ mðμ þ μ − Þ ≤ 6000 MeV=c 2 . The reduced mass range used in the lifetime measurement excludes most of the B 0 → μ þ μ − and physical background decays that populate the lower dimuon mass region, greatly simplifying the fit and making it possible to impose less stringent PID requirements used to reject misidentified background, thus increasing the signal selection efficiency. To avoid potential biases, the candidates in the mass region 5200 ≤ mðμ þ μ − Þ ≤ 5445 MeV=c 2 , where B 0 ðsÞ → μ þ μ − candidates peak, were not examined until the analysis procedure was finalized.
In addition to the signal channels, B 0 ðsÞ → h þ h 0− and B þ → J=ψK þ decays are selected as normalization and control channels, and B 0 s → J=ψϕ as control channel. Candidate B 0 ðsÞ → h þ h 0− decays are selected using the same requirements as the signal channels, except that the muon identification criteria are replaced with hadron identification, the events are triggered independently of the decay final state, and the J=ψ veto is not applied. Candidate B þ → J=ψK þ and B 0 s → J=ψϕ decays are formed by combining a muon pair, with mass close to the J=ψ mass [51], with one track (B þ → J=ψK þ ) or two oppositely charged tracks consistent with originating from a ϕ decay (B 0 s → J=ψϕ), with the kaon mass hypothesis assigned. All tracks forming B þ → J=ψK þ and B 0 s → J=ψϕ candidates are selected with the same requirements as those applied to select the B 0 ðsÞ → μ þ μ − candidates, except for the dimuon mass range and the particle identification criteria. The muons are only required to pass the muon system identification criteria [50] and no kaon identification criteria are required, given the already excellent signal purity achieved. The same trigger strategy as for the signal decays is used for these two channels.
Background events are further rejected using a loose requirement on the response of a boosted decision tree [38,52,53], which was first described in Ref. [54] and has remained unchanged. The classifier takes as input: the angle between the direction of the momentum of the B 0 ðsÞ candidate and the direction defined by the vector joining the primary and the secondary vertices; the B 0 ðsÞ candidate IP and its vertex χ 2 ; the minimum IP of the muons with respect to any PV; the minimum distance between the two muon tracks; the χ 2 of the SV. This classifier is also applied to the control and normalization channels, where for the B þ → J=ψK þ and B 0 s → J=ψϕ modes the χ 2 of the SV is replaced with that of the J=ψ vertex. The selected sample of B 0 ðsÞ → μ þ μ − candidates is dominated by random combinations of two muons (combinatorial background), mainly originating from semileptonic decays of two different b hadrons. The reconstruction and selection efficiencies for the signal and normalization modes are reported in Sec. VI C.
Isolation variables, which quantify the possibility that other tracks in the event originated from the same hadron decay as the signal muon candidates, are constructed in order to further reject background. Most combinatorial background candidates arise from semileptonic b-hadron decays, where other charged particles produced in the decay may be reconstructed close to the signal muon candidate. Two isolation variables are designed to recognize these particles, each considering a different type of track: one uses additional tracks that have been reconstructed both before and after the magnet (long tracks), while the other considers tracks reconstructed only in the vertex detector (VELO tracks). These isolation variables are defined based on the proximity of the two muons forming the B 0 ðsÞ candidate to other tracks in the event. The closeness of each muon candidate to either a long track or a VELO track is measured using two dedicated multivariate classifiers that take the following quantities as inputs: the minimum χ 2 IP of the track with respect to any PV; the signed distance between the muon-track vertex and the PV associated to the B 0 ðsÞ candidate; the signed distance between the muon-track vertex and the B 0 ðsÞ decay vertex; the distance of closest approach between the track and the muon; the angular separation between the track and the muon; a quantity that measures the compatibility of the muon-track system with having originated from the PV associated to the B 0 ðsÞ candidate. The long-track isolation classifier takes three additional variables as input: the absolute difference between the azimuthal angles of the track and the muon; the absolute difference between the pseudorapidities of the track and the muon; and the track p T . The classifiers are trained on collections of trackmuon pairs from simulated B 0 s → μ þ μ − decays and from simulated bb → μ þ μ − X events. The latter sample includes decays with two muons originating from two different bhadrons or from the same b-hadron. In both cases the muons can either originate directly from the hadron containing the b-quark or from intermediate resonances.
Only tracks originating from the same b-hadron decay as the muon candidate are considered to train the classifier with the simulated bb → μ þ μ − X events. The output value of the classifiers for a given track-muon pair is defined to be higher when the track is "closer" to the muon. Defining Iðμ AE Þ as the maximum value of a given classifier over all the track-muon pairs in the event, the long and VELO-track isolation variables are each defined as Iðμ þ Þ þ Iðμ − Þ.
Signal and background events are separated using a final boosted decision tree classifier that combines kinematic, topological and isolation information, defined as in the previous measurement [32]. The BDT response is used to divide the data into samples of varying signal purity, which are then fitted simultaneously as described in Secs. VIII and IX. The classifier is trained using simulated samples of B 0 s → μ þ μ − decays as signal and of inclusive bb → μ þ μ − X events as proxy for the combinatorial background. It combines information from the following input variables: ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi where Δϕ and Δη are the azimuthal angle and pseudorapidity differences between the two muon candidates; the minimum χ 2 IP of the two muons with respect to the B 0 ðsÞ associated PV; the angle between the B 0 ðsÞ candidate momentum and the vector joining the B 0 ðsÞ decay vertex and B 0 ðsÞ associated PV; the B 0 ðsÞ candidate vertex-fit χ 2 ; the B 0 ðsÞ impact parameter significance with respect to the B 0 ðsÞ associated PV; the long-and VELO-track isolation variables. The BDT classifier response is defined to have an approximately uniform distribution in the range 0 ≤ BDT ≤ 1 for signal, and to peak at zero for background. Its correlation with the dimuon mass is below 5%. The branching fraction measurement is performed by dividing the Run 1 and Run 2 data samples into six subsets each, based on regions in the BDT response with boundaries 0, 0.25, 0.4, 0.5, 0.6, 0.7 and 1. Figure 2 shows the expected BDT distribution for B 0 s → μ þ μ − decays, as determined in Sec. V B, and combinatorial background. The sample with 0 ≤ BDT < 0.25 is discarded as it is dominated by background. The data used in the measurement of the B 0 s → μ þ μ − effective lifetime are split into two regions in the BDT response with the ranges 0.35 ≤ BDT < 0.55 and 0.55 ≤ BDT ≤ 1, which are chosen to minimize the expected statistical uncertainty on the effective lifetime based on the results of pseudoexperiments.

V. SIGNAL CALIBRATION
The dimuon mass and the BDT classifier response are used to separate signal from background in the determination of the branching fractions and the B 0 s → μ þ μ − effective lifetime. It is therefore essential that these variables are accurately calibrated in order to account for possible discrepancies between data and simulation. The calibration procedures for these two variables are described in the following sections.

A. Mass shape calibration
The mass shape of the B 0 s → μ þ μ − and B 0 → μ þ μ − signals is described with a double-sided Crystal Ball (DSCB) function [55] fðmjμ; σ; α l ; n l ; α r ; n r Þ where m is the dimuon mass and all the parameters are positive. The function has a Gaussian core with mean μ and resolution σ and power-law tails on both sides defined by two starting points in units of σ, α l and α r , and two slopes, n l and n r for the left and right side, respectively. The mean of the B 0 s → μ þ μ − and B 0 → μ þ μ − signal peaks are calibrated with data samples containing B 0 s → K þ K − and B 0 → K þ π − decays, respectively. Besides the contamination from combinatorial background, the B 0 → K þ π − sample contains contributions from B 0 s → K − π þ decays and partially reconstructed background decays, Fig. 3, are modeled with a DSCB function. The difference between the B 0 and B 0 s mass values, taken from Refs. [51,[56][57][58], is used to constrain the B 0 s → K − π þ mean with respect to the B 0 → K þ π − mode. The mass resolutions for the two modes are constrained using the mass resolution calibration described below. The combinatorial background is modeled with an exponential shape with its slope parameter left to vary freely. The partially reconstructed background component is described by an ARGUS shape [59], while the component for Λ 0 b → ph − decays is modeled as the sum of two Crystal Ball (CB) functions [55], with all parameters, except the total yield, fixed from simulation. The results of the fits are shown in Fig. 3. To check the correlation between the PID selection and the mean of the B 0 ðsÞ → h þ h 0− signal peak, and to study the effect of possible contamination from misidentified background, the fit is repeated after tightening the PID requirements. The variation of the mean value is assigned as a systematic uncertainty. The mean is found to be uncorrelated with the BDT response.
The mass resolution is calibrated with data samples containing charmonium (J=ψ, ψð2SÞ) and bottomonium [ϒð1SÞ, ϒð2SÞ and ϒð3SÞ] resonances decaying into two muons, selected similarly to the signal. The natural widths of all these resonances are negligible compared to the mass resolution of the LHCb experiment. The resolution of each resonance is obtained from a mass fit to the data. The distributions of the dimuon mass, m μ þ μ − , of quarkonium decays, shown in Fig. 4, are modeled with a DSCB function. The combinatorial background is modeled with an exponential shape. In the bottomonium fits, the tails are constrained from simulation. A second-order Chebychev polynomial is used as alternative shape for the combinatorial background. The difference between the mass resolutions measured with the two background descriptions is taken as systematic uncertainty. The power-law function Þ a 2 is found to describe the mass resolution of simulated Drell-Yan events accurately. This function is fitted to the measured resolutions of the quarkonia, including their systematic uncertainties, and used to determine the mass resolution at the B 0 s and B 0 mass, as shown in Fig 5. The mass resolution in Run 2 is found to be slightly better than in Run 1, which is explained by improvements in the track reconstruction. The ϒð3SÞ resolution is larger than expected from the power-law function and thus an interpolation with a third-order polynomial was also performed. The differences of the interpolated B 0 and B 0 s mass resolutions with respect to their default widths are assigned as a systematic uncertainty.
The four tail parameters of the B 0 ðsÞ → μ þ μ − signal shape are in common between the B 0 and B 0 s decays. They are determined from the mass distributions in simulation, after convoluting them with a Gaussian function to match their core resolutions with the values found in data as described in the previous paragraph.
Each step of the mass calibration is performed separately for each data-taking year. Subsequently, average results for the mass shape parameters for Run 1 and      Table I. A small but significant dependence of the mass resolution with the BDT response is found, for which correction factors ranging from 0.97 to 1.03 are applied. For the lefthand tail parameters, BDT-dependent correction factors ranging from 0.9 to 1.2 are obtained.

B. BDT calibration
The BDT response is determined using simulated decays, to which corrections are applied to account for possible discrepancies between simulation and data. This calibration procedure is performed in three steps. First, the B -meson kinematics and the detector occupancy of the simulation are corrected using control channels in data. Second, the effect of the PID and trigger selections are evaluated on data and used to correct the BDT response. Finally, since the BDT response for real B 0 s → μ þ μ − decays is strongly correlated with their decay-time distribution and hence with the B 0 s → μ þ μ − effective lifetime, which is a priori unknown, an additional correction is applied under different effective lifetime hypotheses. Taking these three correction factors into account, the fraction of signal decays in each BDT region, f BDT;i , can be expressed as where f μμ sim;i is the fraction of events per BDT region in the corrected simulation, f μμ PID;i and f μμ trig;i are the weights used to correct for the PID and trigger selections, respectively, and the effective lifetime correction k i is used for B 0 s → μ þ μ − and B 0 s → μ þ μ − γ decays. No such correction is needed for B 0 → μ þ μ − decays due to the small width difference between the B 0 mass eigenstates.
The B meson kinematics of the simulated signals are corrected using a gradient boosting reweighter. This technique consists of training a boosted decision tree classifier to align two samples, in this case data and simulation, as described in Ref [60]. The transverse momentum p T , the pseudorapidity η, and the χ 2 IP of the B candidate are used as input variables for the gradient boosting reweighter, as these are the variables required to correct the simulation. The weights obtained from this procedure are applied to all simulation samples used for calibration and normalization. The kinematic distributions for B 0 and B 0 s mesons differ, as determined in hadronization fraction measurements [61], thus they are corrected with a sample of B þ → J=ψK þ and B 0 s → J=ψϕ decays, respectively. An additional correction to the BDT distribution shape stems from the event occupancy, measured as the number of tracks in the event. As this affects the muon track isolation variables, which are important inputs to the BDT classifier, the correction is determined in four intervals of the total number of reconstructed tracks. The correction weights are determined by comparing the relative number of B þ → J=ψK þ decays in background-subtracted data and simulated samples in these intervals. It is ensured that the input variable distributions of B þ → J=ψK þ candidates match those of B 0 ðsÞ → μ þ μ − candidates as closely as possible by evaluating the BDT response based on the final state muons and the B þ candidate, with two exceptions: the final state kaon is excluded from the calculation of the isolation variables, and the decay vertex χ 2 is determined on the J=ψ candidate, to match the number of degrees of freedom of the signal.
The PID efficiency correction per BDT region for B 0 ðsÞ → μ þ μ − decays, f μμ PID;i , is determined on dedicated calibration samples and convolved with the muon kinematics of simulated signal per region, as described in Sec. VI C. As the total PID efficiency is part of the normalization (Sec. VI), the PID efficiency correction of the BDT response is determined as the relative PID efficiency per BDT region; no uncertainty is assigned on this correction, as it is already included in the total efficiency.
A similar procedure is adopted for the trigger efficiency per BDT regions, f μμ trig;i . The trigger efficiencies are determined on data samples containing B þ → J=ψK þ decays with the trigger calibration method (see Sec. VI C) [62]. They are calculated in ranges of the maximum p T of the two muons and the product of the p T of the two muons, which are the variables used for the muon and dimuon hardware trigger. The trigger efficiency per BDT region is determined by the convolution of the obtained efficiencies with the kinematics of simulated signal. The details of this method, also employed for the full efficiency determination, are given in Sec. VI C.
In simulation, B 0 s → μ þ μ − ðγÞ decays are generated using the mean B 0 s lifetime, while the effective lifetime can have any value between the lifetime of the light and the heavy mass eigenstates. As the BDT classifier is correlated with the B 0 s candidate decay time, an additional correction is included for the B 0 s → μ þ μ − ðγÞ BDT response distributions. The correction is evaluated for A μμ ΔΓ s ¼ −1, 0 and 1, corresponding to τ μ þ μ − ¼ 1.423, 1.527 and 1.620 ps, and covering the full physically allowed range. Simulated candidates selected with the procedure described in Sec. IV are weighted according to where t j is the reconstructed decay time of the candidate j, τ gen is the lifetime used for generation, and τ μ þ μ − is the effective lifetime calculated from y s , τ B 0 s and A μμ ΔΓ s . A correction factor, k i , is then calculated for each BDT region according to where N i is the number of signal decays in each BDT region. The calibrated BDT response for B 0 s → μ þ μ − decays under the SM hypothesis A μμ ΔΓ s ¼ 1 and for B 0 → μ þ μ − decays are shown in Fig. 6. The main systematic uncertainties on the calibrated BDT distribution arise from the limited samples used for the trigger efficiency and the event occupancy corrections; they are summed in quadrature with each other and with the statistical uncertainties to determine the total uncertainty on the BDT distribution. The systematic uncertainty on the trigger efficiency correction has been evaluated using B þ → J=ψK þ as a control channel and is related to possible discrepancies between the signal decay and the control channel and to mismodeling in the simulation used in the TISTOS method described in Sec. VI C. The systematic uncertainty due to the event occupancy is obtained by comparing the correction obtained with the default and an alternative interval scheme used to determine the occupancy weights. The systematic uncertainties from the PID selection correction and kinematic reweighting are found to be negligible.
To validate the BDT calibration procedure, an alternative calibration is performed using a data sample containing B 0 ðsÞ → h þ h 0− decays. While this procedure directly measures the BDT distribution on data and the BDT response is expected to be very similar for any two-body B decay, B 0 ðsÞ → h þ h 0− decays require significant corrections to be compared to B 0 ðsÞ → μ þ μ − decays, making it less precise than the default strategy.
The two most frequent B-meson decays into two hadrons, candidates as for the signal, except for the trigger and the particle identification requirements. A trigger selection independent of the B 0 ðsÞ decay products is applied to the candidates to select B 0 ðsÞ → h þ h 0− decays in order to avoid selection biases. Then, a PID selection is applied to separate Normalised yield LHCb simulation To avoid correlated uncertainties, the BDT distribution from B 0 ðsÞ → h þ h 0− data corrected for PID and trigger efficiencies is compared with the distribution of B 0 ðsÞ → μ þ μ − decays from corrected simulation samples. The B 0 ðsÞ → h þ h 0− PID efficiencies are determined with a dedicated procedure [63], while the trigger efficiency is evaluated on B þ → J=ψK þ decays with the trigger calibration method reported in Sec. VI C for the hardware trigger and first software trigger requirements, and determined with B 0 ðsÞ → h þ h 0− simulated samples for the second software trigger selection.
Therefore, the fraction of B 0 ðsÞ → h þ h 0− decays per BDT region, f 0 BDT;i , can be described as where f hh 0 PID;i and f hh 0 trig;i are the relative PID and trigger efficiencies for B 0 ðsÞ → h þ h 0− decays versus B 0 ðsÞ → μ þ μ − decays. The corrected B 0 → K þ π − and B 0 s → K þ K − distributions are compared with those determined on corrected simulated B 0 → μ þ μ − and B 0 s → μ þ μ − samples, respectively, as shown in Fig. 7. The BDT distributions of B 0 s → μ þ μ − and B 0 s → K þ K − decays are compared for the same effective lifetime, namely for A μμ ΔΓ s ¼ 1. Because of the good agreement between the two different methods used to calibrate the BDT response, no additional systematic uncertainty is assigned to the BDT distribution.

VI. NORMALIZATION
The branching fractions of the signal channels are estimated by comparing their yields in data with those of two normalization channels with well-known branching fractions, B þ → J=ψK þ and B 0 → K þ π − , according to where B, ε and N are the branching fraction, efficiency and yield of the corresponding channel and f sigðnormÞ indicates the fragmentation fraction of the relevant B meson. Signal candidates having BDT < 0.25 are not included in the fit to the dimuon mass distribution. The parameter α sig is the single-event sensitivity. In the following, the different elements entering Eq. (9) are described. The final singleevent sensitivity is obtained for each signal channel as the weighted average of those obtained with the two normalization channels, taking the correlations between the inputs into account. The branching fractions of the two normalization channels are taken as BðB þ → J=ψK þ Þ ¼ ð6.02 AE 0.17Þ × 10 −5 [64][65][66], including the J=ψ → μ þ μ − branching fraction, and BðB 0 → K þ π − Þ ¼ ð1.96 AE 0.05Þ × 10 −5 [51,[67][68][69][70][71][72][73][74][75]. The efficiency for the B 0 s → μ þ μ − γ channel is calculated only for those signal decays in the region m μμ > 4.9 GeV=c 2 where the branching fraction is measured.

A. Normalization channel yields
The yields of the normalization channels are obtained through unbinned extended maximum-likelihood fits to the mass distributions of the candidates for each datataking year separately, after the corresponding selection described in Sec. IV. To improve the mass resolution of B þ → J=ψK þ candidates, the J=ψ mass is constrained to its known value [51,[76][77][78][79]. The mass distributions for selected B þ → J=ψK þ candidates are shown in Fig. 8 for the different data taking periods. The mass distribution of signal B þ → J=ψK þ decays is described by a Hypatia function [80] with parameters Gaussian-constrained to the values derived from simulation within their uncertainties, except for the mean and width, which are free to vary in the fit. In addition to signal, the selected candidates contain a contribution of combinatorial background and a small contamination from the B þ → J=ψπ þ decay, where a pion is misidentified as a kaon. The mass distribution of the combinatorial background is described by an exponential function with the slope left free to vary in the fit. The B þ → J=ψπ þ decay, which is expected to peak at higher mass values than the B þ mass due to assigning the kaon mass to a pion track, is described by an analytical function developed in Ref. [81]. This function is obtained by transforming a Gaussian mass distribution under the pion hypothesis to one under the incorrectly-assigned kaon mass hypothesis, using an analytical description of the candidate kinematics. As an alternative model a non-parametric function tuned on a simulated B þ → J=ψπ þ sample, where the events were reconstructed under the B þ → J=ψK þ hypothesis as in Ref. [82], has been used to cross check the results. The fits with the two different descriptions for B þ → J=ψπ þ decays give compatible results for the B þ → J=ψK þ yields.
Additional possible background from B þ c → J=ψK þ K − π þ and Λ 0 b → J=ψpK − decays has been investigated and found to be small and evenly distributed in the considered mass range, and hence is considered negligible.
The yield of B 0 → K þ π − candidates is determined with a binned maximum-likelihood fit to the data, using the fit model for B 0 → K þ π − candidates described in Sec. VA. In contrast to the mass calibration, events are required to be triggered independently of the B 0 → K þ π − signal, such

fb Data 2018 Total
FIG. 9. Mass distribution of B 0 → K þ π − candidates in data for different data-taking years, triggered independently of the signal. Superimposed is a fit to the distribution: the blue line shows the total fit, the red dashed line is the B 0 → K þ π − component, the magenta dashed line is theB 0 s → K þ π − component, the green dashed line is the combinatorial background, and the brown dashed line is the partially reconstructed background component. that the trigger efficiency for this hadronic channel can be determined on B þ → J=ψK þ decays with the same TISTOS method that is used to determine the signal efficiency. The mass distribution of B 0 → K þ π − candidates is shown in Fig. 9 together with the result of the fit, for the different data-taking years. The yields of the two normalization channels are reported in Table II, for the different data-taking years and for the two data-taking periods combined.

B. Fragmentation fractions
The fragmentation fractions, denoted as f u , f d , f s , and f baryon , are the probabilities for a b quark to hadronize into a B þ , B 0 , B 0 s meson or a b baryon, respectively. These fractions include contributions from intermediate states decaying to the aforementioned hadrons via the strong or electromagnetic interactions. The ratio of fragmentation fractions, f s =f d , used in this analysis has been measured at LHCb using several B decay modes: semileptonic B → DμX decays at 7 TeV [83] and at 13 TeV [84]; hadronic B → Dh decays, where h ¼ π, K, at 7, 8 and 13 TeV [85,86]; B → J=ψX decays at 7, 8 and 13 TeV [82]. These measurements have been combined in Ref. [61]. The value of f s =f d is found to be dependent on B transverse momentum and pp collision centre-of-mass energy, while it is found not to be dependent on pseudorapidity. Here only the integrated values at different energies are used since the average p T of b-hadrons in this analysis is found to be compatible with those used in the determination of the fragmentation fractions [61].
Since Following the approach adopted in Ref. [61], isospin symmetry is assumed to hold in b quark hadronization at the LHC such that f u ¼ f d , and hence the same f s =f d values are used in relation to the B þ → J=ψK þ normalization channel.

C. Efficiencies
The efficiencies to detect the signal and normalization channels can be factorized as into reconstruction within the LHCb detector and selection (RecSel), PID, trigger (Trig) efficiencies and exclusion of the first BDT region (BDT > 0.25) on signal candidates. These are evaluated separately on top of each preceding stage. The acceptance, reconstruction and selection efficiencies are evaluated using simulation with corrections applied to improve the agreement with data. The efficiency to detect and reconstruct tracks is evaluated on a sample of J=ψ → μ þ μ − decays in data [87], using a tag-and-probe method. These samples are used to determine efficiency correction factors as a function of the particle kinematics, which are convolved with the simulated samples to calculate the total efficiency correction. The corrections are at the level of 1% for all channels and data-taking years. When considering the ratio of signal and normalization channels in the normalization formulae [Eq. (9)], uncertainties on these corrections are treated as 100% correlated. The total efficiencies for the reconstruction within the LHCb detector and selection are listed for the relevant channels in Table III, where the efficiency with which the muon system detects muons is included. Correction factors for the imperfect modeling of the muon system efficiency simulation are estimated using a sample of B þ → J=ψK þ decays in data, which are selected without particle identification criteria [63,88]. These corrections are applied to the signal, B þ → J=ψK þ and B 0 s → J=ψϕ channels, modifying the efficiencies by 1%-3%.
The efficiency of the PID requirements described in Sec. IV is measured using high-purity control samples of each particle species obtained from data [63,88]. These control samples are obtained by means of kinematic requirements only, with muons obtained from J=ψ → μ þ μ − and B þ → J=ψK þ decays, pions and kaons from D 0 → K − π þ decays selected via D Ãþ → D 0 π þ , and protons from Λ → pπ − and Λ þ c → pK − π þ decays. The muon PID efficiencies are evaluated as a function of the muon momentum and pseudorapidity, as well as the track multiplicity of the event using a dedicated procedure [63]. The resulting efficiency maps are then applied to simulated samples to determine the integrated efficiency for a specific channel. The efficiency measurements for the different hadronic species are described in Sec. VII A. The results for the signal and normalization channels are shown in Table IV and include for the channels with muons the data-simulation correction of the muon system identification. For B þ → J=ψK þ candidates, only the correction to the muon system identification efficiency is computed, as no further requirements on the multivariate PID classifier are applied when selecting these decays. The systematic uncertainties arise from modeling the dependencies of the PID efficiency maps.
The trigger efficiencies are determined from data with the TISTOS method [62]. Trigger information is associated to the reconstructed candidates during the offline processing. The event of a reconstructed signal candidate can be classified into three categories: events triggered on signal (TOS), triggered on part of the underlying event that is independent of the tracks forming the signal candidate (TIS), or triggered on both elements of the signal candidate and the underlying event.
The trigger efficiency can be estimated by exploiting the overlap between the TIS and TOS categories (TIS&TOS) and assuming signal decays uncorrelated with the rest of the event. The trigger efficiency, ε trig , of a given decay channel, with respect to a total of N Tot events, can be computed as where N X is the number of background-subtracted candidates triggered within the category X and the efficiency ε TIS ¼ N TIS&TOS =N TOS is estimated under the already mentioned independence assumption, which is verified in Ref. [62]. The trigger efficiency estimation is done in two steps: the hardware and first software-level trigger efficiencies are estimated from data as described in the following; the second software-level efficiency, being aligned to the offline selection, is estimated from simulation and included in the full trigger efficiency presented in this paragraph. The trigger efficiencies for signal and normalization channels with muons are calibrated using the B þ → J=ψK þ channel. In order to reduce residual kinematic correlations between the decay in question and the rest of the event, the calibration is performed in intervals of kinematic quantities. Yields of B þ → J=ψK þ decays for each trigger category and different kinematic ranges are obtained by performing a mass fit as described in Sec. VI A. Efficiency tables are obtained as a function of the maximum p T of the two muons and of the product of the p T of the two muons, as these are the variables used in the muon hardware trigger. These efficiency distributions are then convolved with the simulated samples of the relevant channels. The trigger efficiencies for the B 0 → K þ π − channel are also obtained in data by measuring the TIS trigger efficiency in Eq. (11) through the more abundant B þ → J=ψK þ channel, since the TIS efficiencies do not depend on the control channel used to evaluate it.
The trigger efficiencies for the signal and normalization channels in each data-taking period are presented in Table V. The systematic uncertainty on the trigger efficiency is comprised of a number of sources. A systematic effect is associated with the choice of the mass model used for the B þ → J=ψK þ channel. This is estimated by fitting the B þ → J=ψK þ data using a double Crystal Ball function TABLE IV. Particle identification efficiencies for the signal and normalization channels, averaged for the two running periods, where the first uncertainty is statistical and the second systematic. A data-simulation correction of the muon-system identification is included for channels with muons. For the B þ → J=ψK þ channel only the datasimulation correction part of the muon identification is reported, as no multivariate PID requirement is applied to this channel.
to model the signal and taking the difference with the default fit as systematic uncertainty. A second systematic uncertainty stems from the precision of the TISTOS method and is obtained by comparing the efficiency determined by applying the TISTOS method to simulation. A third source of systematic effect is due to the difference in the phase space between the B þ → J=ψK þ decay and the decay channel for which the trigger efficiency is evaluated. The corresponding systematic uncertainty is estimated by comparing the results of the method applied to simulated events using the B þ → J=ψK þ channel or the considered channel. The last source of systematic effect arises from the choice of the kinematic ranges for which the efficiencies are evaluated, and its uncertainty is determined from the change of the efficiency when these ranges are varied. The resulting shifts in trigger efficiency from each source are added in quadrature, and assigned as the total systematic uncertainty in Table V.
The efficiencies of the exclusion of the first BDT region on the signal decays are evaluated using the calibrated BDT response described in Sec. V B and are listed in Table VI, combining statistical and systematic uncertainties.

D. Single-event sensitivities
The single-event sensitivities [defined in Eq. (9)] for the three signal channels in Run 1, Run 2 and the full data sample, are reported in Table VII. Single-event sensitivities TABLE V. Trigger efficiencies for the signal and normalization channels, averaged for the two data taking periods. The first uncertainty is statistical and the second systematic.
TABLE VII. Single-event sensitivities, αðB þ Þ, αðB 0 Þ and αðCombÞ for the three signal channels obtained for BDT > 0.25 with the two normalization channels, B 0 → K þ π − and B þ → J=ψK þ , and combined, for Run 1, Run 2 and the full dataset. The first uncertainty is statistical and the second systematic. The expected yields assuming SM branching fractions, N exp , are also reported. For B 0 s → μ þ μ − γ, an approximate branching fraction of 1 × 10 −10 for m μμ > 4.9 GeV=c 2 has been assumed. for BDT > 0.25 are obtained using the two normalization channels B 0 → K þ π − and B þ → J=ψK þ separately, which are combined into a normalization using a weighted average, taking into account the relevant correlations. In the same table, the expected number of signal candidates for BDT > 0.25 is reported, assuming the SM branching fraction. In order to cross-check the normalization of the signal channels, the ratio of the efficiency-corrected yields of the two normalization channels This ratio is found to be 0.340 AE 0.016ðstatÞ and 0.336 AE 0.012ðstatÞ in Run 1 and Run 2, respectively, in agreement with the ratio of the world averages of these branching fractions, 0.326 AE 0.012 [51].
To cross-check the ratio of the B 0 s and B þ fragmentation fractions and its stability over the data taking, the ratio of B þ → J=ψK þ and B 0 s → J=ψϕ efficiency-corrected yields, is also measured, following a similar approach to Ref. [82]. The ratios are found to be similar to those measured in Ref. [82], although the two methods explore different kinematic regions. A dependence on the center-of-mass energy is seen and found to be consistent with Ref. [82] and the combined analysis of Ref. [61], justifying the use of different f s =f d values for the Run 1 and Run 2 data samples.

VII. BACKGROUND
Three classes of background events are considered in the analysis: combinatorial background; B 0 ðsÞ → h þ h 0− decays; semileptonic b-hadron decays. The combinatorial background, mainly composed of real muons originating from two different B decays, is modeled using an exponential function with the slope left free to float in the mass fit, as described in Sec. VIII. The other background sources are included as separate components in the fit, with mass shapes evaluated on simulated events and with yields that are Gaussian-constrained to their estimated values, as explained in the following sections.

A. Hadron misidentification rates
To estimate the yield of physical background where one or two final-state particles are misidentified as a muon, it is crucial to perform unbiased measurements of the probability for protons, pions and kaons to pass the muon identification requirements. These measurements are carried out as a function of the track momentum and transverse momentum, using the data control samples listed in Sec. VI C. The dedicated procedure from Ref. [63] is used for protons, while a different method is developed to determine the pion and kaon misidentification rates, using D 0 → K − π þ from D Ãþ → D 0 π þ decays. For these particles, especially at low momenta, a sizeable contribution to the misidentification rate originates from hadrons decaying to muons. When the hadron decays in flight, the momentum resolution of the reconstructed track degrades by an amount that depends on the distance the hadron has travelled before decaying and on the fraction of energy inherited by the daughter muon. As a consequence, the mass distribution of the D 0 candidates broadens significantly and the efficiency to select D 0 → K − π þ decays in the D 0 mass selection window, 1825 ≤ m Kπ ≤ 1910 MeV=c 2 , decreases. If this effect is not taken into account, a significant underestimation of the misidentified hadron yield would occur. The misidentification efficiencies for pions and kaons are determined by measuring the D 0 yield from a two-dimensional fit to the m Kππ − m Kπ , m Kπ distribution with or without the muon requirement applied to the particle in question. The shape of the signal D 0 mass distribution when the PID selection is applied includes the tail arising from hadron decays-in-flight, estimated from simulated events. The resulting misidentification probability is then corrected for the fraction of D 0 → K − π þ decays with the K − π þ mass falling outside the D 0 selection window, estimated from simulated events.
ðsÞ → h þ h 0− decays can appear as background when both final state hadrons are misidentified as muons. These candidates have a broad mass distribution centered close to the B 0 mass, as determined from simulation where each of the four modes is weighted according to its expected yield.
The expected yield of doubly misidentified B 0 ðsÞ → h þ h 0− events is estimated by normalizing to the B 0 → K þ π − channel as where ε trig B 0 s →μ þ μ − is the signal trigger efficiency, N TIS hh is the number of B 0 ðsÞ → h þ h 0− TIS events evaluated by correcting the B 0 → K þ π − TIS yield by the expected fraction of this mode, ε TIS is the TIS efficiency (Sec. V), and ε hh→μμ represents the double misidentification rate, which is estimated using data control samples (Sec. VII A) and found to be in the range 10 −6 − 10 −5 , depending on the dataset and BDT region. An independent estimate of the B 0 ðsÞ → h þ h 0− background is performed on πμ and Kμ combinations, selected from data samples of B candidates with two tracks in the final states applying strong muon and hadron identification requirements on the tracks. Their mass spectra are fitted and the resulting yields are scaled by the π → μ and K → μ misidentification rates. The ratio between this result and the default estimate is assigned as a correction factor to the misidentification efficiency. The correction factor is consistent with 1 with an uncertainty of 0.03 for BDT > 0.4. The estimated B 0 ðsÞ → h þ h 0− background yields in each BDT region with BDT > 0.25 are summarized in Table VIII for Run 1 and Table IX for Run 2 data. The background yield per fb −1 is smaller in Run 2 compared to Run 1 due to the improvement of the muon identification algorithm, which has a better charged-kaon rejection power with a similar efficiency to select muons.

C. Semileptonic decays
Several semileptonic b-hadron decays, with branching fractions ranging from 10 −8 to 10 −4 , are considered in the fit: B þ c → J=ψμ þ ν μ , with J=ψ → μ þ μ − , and B 0ðþÞ → π 0ðþÞ μ þ μ − decays have two real muons in the final state, decays represent non-negligible background when the final-state hadron is misidentified as a muon. When reconstructed as dimuon candidates, these decays are partially reconstructed and therefore populate the lower B 0 =B 0 s sideband, but can have tails reaching into the signal region.
For each of the above channels, the number of expected candidates is estimated by normalizing to the yield of B þ → J=ψK þ decays, according to where f x is the hadronization fraction of the initial-state hadron for the decay mode x, B x is the branching fraction and ε tot x its total selection and trigger efficiency. The efficiencies are estimated from simulation, except for the PID, which is estimated from data control samples as described in Secs. VI C and VII A. The branching fraction, including the hadronization fraction of the B þ c → J=ψμ þ ν μ channel, is taken from Ref. [89], while those of B 0ðþÞ → π 0ðþÞ μ þ μ − and B 0 → π − μ þ ν μ channels are obtained from Refs. [51,90], assuming f u ¼ f d . LHCb measurements for the Λ 0 b → pμ −ν μ and B 0 s → K − μ þ ν μ branching fractions [91,92] and hadronization fractions [61,84] are used. The estimated yields in each BDT region with BDT > 0.25 are shown in Table VIII for Run 1 and Table IX for Run 2 data.

VIII. MEASUREMENT OF SIGNAL BRANCHING FRACTIONS
The data sample containing B 0 ðsÞ → μ þ μ − ðγÞ candidates is divided into the two data-taking periods, Run 1 and Run 2, which are further divided into six subsets based on the BDT response, using the intervals defined in Sec. IV. The branching fractions of the signal decays are determined using an unbinned extended maximum-likelihood fit to the dimuon mass distributions, performed simultaneously on the subsets with BDT > 0.25. Due to substantial contamination from combinatorial background, the lowest BDT region, 0 ≤ BDT < 0.25, is excluded from the dataset but TABLE IX. Expected background yields per BDT region and for BDT > 0.25 with their total estimated uncertainties for Run 2 data.    its fraction is taken into account in the total normalization of the BDT shape. The fit is performed in a mass window of 4900 ≤ m μμ ≤ 6000 MeV=c 2 . The dimuon mass distribution is shown in Fig. 10 for the Run 1 and Run 2 samples in all BDT intervals. The low mass region is populated by the partially reconstructed background and B 0 s → μ þ μ − γ decays, while the higher mass region is dominated by combinatorial background.
The probability density functions (PDFs) of the B 0 s → μ þ μ − and B 0 → μ þ μ − decays are described by DSCB functions, defined in Eq. (4), with their parameters Gaussian-constrained to the values measured in Sec. V. The mass distribution of the B 0 s → μ þ μ − γ decay is described using an empirical threshold function where the parameters a, b and s, are determined from simulation, which is based on the theoretical predictions and form factors of Ref. [18]. The parameter b is found to be close to 0.5, while the other parameters vary across the BDT regions. This threshold function is convolved with a Gaussian resolution function which models the effect of the detector resolution. The parameter values are estimated from kinematically weighted simulated events, and fixed in the fit. While the branching fraction prediction for the B 0 s → μ þ μ − γ decay is dependent on the exact form-factor parametrization used [19], the distribution of the dimuon mass at high q 2 is found to not depend significantly on the choice of the form-factor, and so the same threshold function can be used for a range of scenarios. Moreover, varying the detector resolution parameter within the known uncertainties has a negligible effect on the yield of B 0 s → μ þ μ − γ decays. The combinatorial background is modeled with a single exponential function with an independent yield in each BDT region but with common slope parameters for each data-taking period. Both the yields and the parameters are free to float in the fit. The common slope assumption across the BDT intervals is introduced to stabilize the fit and it is estimated to be correct at the level of 20%. The systematic uncertainty on the signal branching fractions associated to this assumption is negligible. The mass distribution is found to be well described by an exponential function in all BDT regions both in data with m μμ > 5450 MeV=c 2 , where the combinatorial background dominates, and in simulated bb → μ þ μ − X events. The B 0 ðsÞ → h þ h 0− and semileptonic b-hadron contributions are described using kernel estimation techniques [93] applied to simulated events in each BDT region. Their expected yields in each BDT region are Gaussian-constrained according to the values reported in Sec. VII. Moreover, common parameters, such as the yields of the normalization channels, efficiencies and branching fractions are shared across all BDT regions and their values are Gaussian-constrained to their estimated values and uncertainties.
The result of the fit in each subset is shown in Fig. 10. The resulting branching fractions of the B 0 with m μμ > 4.9 GeV=c 2 : The statistical uncertainties are evaluated by repeating the fit with all nuisance parameters fixed to the value obtained in the standard fit, where all nuisance parameters are free to float within their constraints. The systematic uncertainties are then computed by subtracting in quadrature the statistical uncertainties from the total ones. The main contribution to the systematic uncertainty of BðB 0 s → μ þ μ − Þ originates from the knowledge of f s =f d (3%), while that of BðB 0 → μ þ μ − Þ is dominated by the knowledge of the B 0 ðsÞ → h þ h 0− and semileptonic b-hadron background (9%). The correlation between the B 0 → μ þ μ − and B 0 s → μ þ μ − branching fractions is found to be −11%. The correlation between the B 0 s → μ þ μ − γ and B 0 → μ þ μ − branching fractions is −25%, while that with B 0 s → μ þ μ − is below 10%. Using the α factors in Table VII, Fig. 11.
The difference between the logarithm of the likelihood values under the presence or the absence of a specific signal component is used to evaluate the significance with Wilks' theorem [94]. The B 0 s → μ þ μ − signal exceeds the background-only hypothesis more than 10 standard deviations (σ), while the statistical significance of the B 0 → μ þ μ − signal is 1.7σ and the B 0 s → μ þ μ − γ signal is compatible with the background-only hypothesis within 1.5σ. Upper limits on the branching fractions of B 0 → μ þ μ − and B 0 s → μ þ μ − γ are evaluated using the CL s method [95] with a onesided test statistic [96] as implemented in Refs. [97,98]. The one-sided test statistic for a given branching fraction value is defined as twice the negative logarithm of the profile likelihood ratio if it is larger than the measured branching fraction and zero otherwise. Its distribution is determined from pseudoexperiments, where nuisance parameters are set to their best fit values for toy generation while central values of the Gaussian-constraints are independently fluctuated within their uncertainty for each pseudoexperiment as described in Ref. [99]. The CL s curves are shown in Fig. 12 from which the limit on the B 0 → μ þ μ − and B 0 s → μ þ μ − γ branching fractions are found to be BðB 0 → μ þ μ − Þ < 2.3ð2.6Þ × 10 −10 ; BðB 0 s → μ þ μ − γÞ < 1.5ð2.0Þ × 10 −9 with m μμ > 4.9 GeV=c 2 ; at 90% (95%) CL. The measured upper limits are shown in Fig. 12, together with the expected ones.
To quantify the impact of the B 0 s → μ þ μ − γ component on the other signal modes, the fit is repeated by fixing its branching fraction to zero. Using this configuration, BðB 0 s → μ þ μ − Þ increases by 2% while the limit on BðB 0 → μ þ μ − Þ decreases by 12%.
As described in Sec. V B, the BDT calibration of B 0 s → μ þ μ − decays depends on the effective lifetime which introduces a model dependence in the measured timeintegrated branching fraction. In the fit, the SM value of The model dependence is evaluated by repeating the fit under the assumptions A μμ ΔΓ s ¼ 0 and −1, finding an increase of the B 0 s → μ þ μ − branching fraction with respect to the SM hypothesis of 4.7% and 10.9%, respectively. On the contrary, the B 0 s → μ þ μ − γ branching fraction decreases with respect to the A μμ ΔΓ s ¼ 1 hypothesis by 2% and 5% with a negligible impact on its limit. The dependence for B 0 s → μ þ μ − and B 0 s → μ þ μ − γ is approximately linear in the physically allowed A μμ ΔΓ s range. To evaluate the ratio of the branching fractions R μ þ μ − defined in Eq. (3), the fit is modified such that R μ þ μ − and BðB 0 s → μ þ μ − Þ are floating observables, which allows for the cancellation of common uncertainties, while BðB 0 s → μ þ μ − γÞ is kept as a floating observable. The ratio is found to be where the first uncertainty is statistical and the second systematic. Using the CL s method described above, the upper limit is evaluated to be at 90%ð95%Þ CL.
The effective lifetime of the B 0 s → μ þ μ − decay is measured using the same data sample as for the branching fraction measurement but with a slightly different selection, described in Sec. IV. The data are divided into two regions of the BDT classifier response and unbinned extended maximum-likelihood fits are performed to the dimuon mass distribution in each region in order to calculate weights using the sPlot method [100]. These weights are then used to extract the B 0 s → μ þ μ − signal decay-time distributions. Finally, the effective lifetime is measured using an unbinned maximum-likelihood fit to the weighted decaytime distributions, performed simultaneously to both BDT regions.
The fits to the dimuon mass used to extract the weights are performed in the range 5320 ≤ m μμ ≤ 6000 MeV=c 2 . The lower limit of 5320 MeV=c 2 removes the low mass region containing most of the physical background, including B 0 → μ þ μ − and B 0 ðsÞ → h þ h 0− decays, so only B 0 s → μ þ μ − decay and combinatorial background components are included in the fit. Residual contamination from physical background in the fit region is low and is treated as a source of systematic uncertainty. The B 0 s → μ þ μ − signal is modeled using a DSCB PDF and the background with an exponential PDF. The parameters of the signal PDF are determined using the same method as for the branching fraction (Sec. VIII) and are fixed in the fit, while the signal and background yields and the decay constant of the combinatorial background exponential are allowed to float freely. The distributions of the dimuon mass in the two BDT regions are shown in Fig. 13.
To make an unbiased measurement of the effective lifetime, the decay-time dependence of the combined trigger, reconstruction and selection efficiency must be accurately estimated. This decay-time acceptance is calculated using simulated B 0 s → μ þ μ − candidates, which have been weighted in order to improve agreement with data. Since the LHCb PID response is known to be imperfectly modeled in simulation, the efficiencies of the PID requirements are calculated using calibration samples of muons in data and applied as weights to simulated B 0 s → μ þ μ − candidates. Weights are also calculated to correct the number of tracks in the pp collision, the kinematics of the B meson and inputs to the BDT classifier using candidates from the B 0 → K þ π − control channel in data.
The parameters of the functions used to model the decaytime efficiency are extracted using unbinned maximumlikelihood fits to the decay-time distributions of simulated B 0 s → μ þ μ − candidates, with the effective lifetime fixed to its true value. Since the decay-time efficiency has a different form in the two BDT regions, two different empirical functions are used. In the low BDT region the efficiency is modeled as where Erf is the error function, t is the reconstructed decay time, a, b, c, d and e are free parameters and εðtÞ ¼ 0 when t < 0.26 ps. The acceptance in the high region is modeled using where f, g and t 0 are free parameters and εðtÞ ¼ 0 when t ≤ t 0 . The forms of these functions with respect to the B 0 s meson decay time are shown in Fig. 14. The different behavior in the two intervals reflect the positive correlation of the B 0 s -meson decay time with the BDT response, so that the low (high) BDT region contains more signal decays with small (large) decay times.
Finally, the B 0 s → μ þ μ − effective lifetime is determined using a simultaneous fit to the background-subtracted decay-time distributions in the two BDT regions in data, where the decay-time distributions are modeled by the acceptance functions above multiplied by an exponential function. Only the effective lifetime is allowed to float freely in the fit, while the parameters of the acceptance function are Gaussian constrained to the results of the fits to simulation.
Pseudoexperiments are used to evaluate several systematic effects that have the potential to bias the measurement. The fit procedure is found to return an unbiased estimate of the lifetime to a precision of 0.009 ps and good coverage. The effects of residual contamination from physical background, predominantly B 0 ðsÞ → h þ h 0− and Λ 0 b → pμ −ν μ decays, is found to introduce a bias of around 0.012 ps. The effect of the decay-time acceptance on the mixture of the light and heavy mass eigenstates is evaluated and found to be negligible [32]. A further source of uncertainty is related to the decay-time distribution of the combinatorial background, which is unknown a priori and can bias the  lifetime measurement if the background lifetime is significantly longer than that of the signal. The decay-time distribution of combinatorial background in the dimuon sample cannot be determined directly from data due to the very small number of candidates and so instead the decaytime model used in the pseudoexperiments is taken from a fit to the decay-time distributions of candidates in the high mass region of the higher yield dihadron sample, which includes both a short and long lived component. A systematic uncertainty is evaluated by fluctuating the mean lifetimes of both components upwards by 1σ, which results in an overall bias on the B 0 s → μ þ μ − effective lifetime of around 0.003 ps. The effect of a production asymmetry between B 0 s andB 0 s mesons [101] is found to be small, at 0.002 ps. The effects of ignoring the detector decay-time resolution and also the choice of signal mass PDF are evaluated and are both found to be negligible. Any correlation between mass and decay-time is found to be negligible, as required by the sPlot method.
The entire procedure used to measure the lifetime is cross-checked by performing measurements of the lifetimes of the B 0 → K þ π − and B 0 s → K þ K − decays, which have much larger branching fractions and have already been precisely measured. Candidates are selected using similar requirements to those used to select B 0 s → μ þ μ − decays, with a few differences. While the efficiency of the B 0 s → μ þ μ − trigger selection is independent of decay time, this is not true for B 0 ðsÞ → h þ h 0− decays where flight distance requirements are imposed on candidates. In order to match the B 0 s → μ þ μ − trigger requirements as closely as possible, B 0 ðsÞ → h þ h 0− events are selected with TIS requirements, eliminating any dependence of the trigger efficiency on decay time. Furthermore, different requirements on particle identification information are imposed in order to separate the K þ π − and K þ K − final states. The decay-time acceptance is evaluated using weighted simulation in the same way as for B 0 s → μ þ μ − and the fit procedure is identical apart from some differences in the fit ranges and the inclusion of a B 0 s → K − π þ component in the B 0 → K þ π − fit. The fits to the B 0 s → K þ K − and B 0 → K þ π − mass and weighted decay-time distributions are shown in Figs. 15 and 16. The measured values of the B 0 s → K þ K − and B 0 → K þ π − lifetimes are found to be where the uncertainties are statistical only. Systematic uncertainties on these cross-check measurements are not evaluated. The results are in agreement with the values measured previously by the LHCb collaboration of τ B 0 s →K þ K − ¼ 1.407 AE 0.016 ps and τ B 0 →K þ π − ¼ 1.524 AE 0.011 ps [102]. The measurements presented here are performed on data samples with very little overlap with those used to make the measurements published in Ref. [102] due to different data-taking periods and trigger requirements. The statistical uncertainty on the measured B 0 s → K þ K − lifetime is taken as the systematic uncertainty associated with the use of simulated events to determine the B 0 s → μ þ μ − acceptance function.  A summary of the systematic uncertainties is reported in Table X. Finally, the effective B 0 s → μ þ μ − lifetime is measured with the fits to the decay-time distributions shown in Fig. 17, as τ μ þ μ − ¼ 2.07 AE 0.29 AE 0.03 ps; where the first uncertainty is statistical and the second systematic. While this value lies above the physical range defined by the lifetimes of the light (A ΔΓ ¼ −1) and heavy (A ΔΓ ¼ 1, predicted by the SM) mass eigenstates, which are τ L ¼ 1.423 AE 0.005 ps and τ H ¼ 1.620 AE 0.007 ps [51], it is consistent with these lifetimes at 2.2σ and 1.5σ, respectively.

X. CONCLUSIONS
In summary, the full Run 1 and Run 2 data sample of the LHCb experiment was analysed to measure the B 0 s → μ þ μ − branching fraction and effective lifetime and to search for the B 0 → μ þ μ − and B 0 s → μ þ μ − γ decays. The branching fractions of the B 0 s → μ þ μ − , B 0 → μ þ μ − and B 0 s → μ þ μ − γ decays are measured to be where the first uncertainty is statistical and the second systematic. The systematic uncertainty on the B 0 s → μ þ μ − decay is significantly reduced compared to previous measurements thanks to a new precise value of the hadronization fraction ratio and a more precise calibration of the BDT response and of the particle misidentification rate. The B 0 s → μ þ μ − branching fraction is the most precise singleexperiment measurement to date.
The B 0 → μ þ μ − and B 0 s → μ þ μ − γ signals are not statistically significant, and consistent with the backgroundonly hypothesis at 1.7 and 1.5σ level, respectively. Upper limits on the branching fractions are set to BðB 0 → μ þ μ − Þ < 2.6 × 10 −10 BðB 0 s → μ þ μ − γÞ < 2.0 × 10 −9 at 95% CL, the latter with m μμ > 4.9 GeV=c 2 . The limit on the B 0 → μ þ μ − decay is the most stringent to date from a single experiment. An upper limit on the B 0 s → μ þ μ − γ branching fraction is determined for the first time. This limit only constrains the high-q 2 region of this decay and no attempt is made here to extrapolate the result to the full branching fraction.
Using the same data sample, with a slightly different selection, the effective lifetime of the B 0 s → μ þ μ − decay is found to be τ μ þ μ − ¼ 2.07 AE 0.29 AE 0.03 ps; where the first uncertainty is statistical and the second systematic.
All the results are compatible with the predictions of the SM and with previous measurements of these quantities. In particular, in the two-dimensional space of B 0 s → μ þ μ − and B 0 → μ þ μ − branching fractions the compatibility is at one standard deviation level. These results significantly constrain possible contributions to these decays from new interactions that would cause effective scalar, pseudoscalar or axial-vector currents, and thus limit the parameter space of new physics models.