Measurement of the $B^0_s\to\mu^+\mu^-$ decay properties and search for the $B^0\to\mu^+\mu^-$ and $B^0_s\to\mu^+\mu^-\gamma$ decays

An improved measurement of the decay $B^0_s\to\mu^+\mu^-$ and searches for the decays $B^0\to\mu^+\mu^-$ and $B^0_s\to\mu^+\mu^-\gamma$ are performed at the LHCb experiment using data collected in proton-proton collisions at $\sqrt{s} = 7,~8$ and $13$ TeV, corresponding to integrated luminosities of 1, 2 and 6 fb$^{-1}$, respectively. The $B^0_s\to\mu^+\mu^-$ branching fraction and effective lifetime are measured to be ${\mathcal{B}}(B^0_s\to\mu^+\mu^-)=\left(3.09^{+0.46+0.15}_{-0.43-0.11}\right)\times 10^{-9}$ and $\tau(B^0_s\to\mu^+\mu^-)=(2.07\pm 0.29\pm 0.03)$ ps, respectively, where the uncertainties include both statistical and systematic contributions. No significant signal for $B^0\to\mu^+\mu^-$ and $B^0_s\to\mu^+\mu^-\gamma$ decays is found and the upper limits $\mathcal{B}(B^0\to\mu^+\mu^-)<2.6\times 10^{-10}$ and $\mathcal{B}(B^0_s\to\mu^+\mu^-\gamma)<2.0\times 10^{-9}$ at 95% confidence level are determined, where the latter is limited to the range $m_{\mu\mu}>4.9$ GeV$/c^2$. Additionally, the ratio between the $B^0\to\mu^+\mu^-$ and $B^0_s\to\mu^+\mu^-$ branching fractions is measured to be $\mathcal{R}_{\mu^+\mu^-}<0.095$ at 95% confidence level. The results are in agreement with the Standard Model predictions.


Introduction
Decays mediated by a quark flavour-changing neutral interaction are not allowed at tree level in the Standard Model (SM) of particle physics but can proceed through quantum loops, making them rare processes. The leptonic B 0 → µ + µ − and B 0 s → µ + µ − decays (the inclusion of charge-conjugated processes is implied throughout this paper) are even rarer because they are additionally helicity-suppressed. As they are characterised by a purely leptonic final state, and thanks to the progress in lattice QCD calculations [1][2][3][4][5], their time-integrated branching fractions are predicted in the SM with small uncertainties to be B(B 0 s → µ + µ − ) = (3.66 ± 0.14) × 10 −9 and B(B 0 → µ + µ − ) = (1.03 ± 0.05) × 10 −10 [6,7]. This makes these processes powerful probes for detecting deviations from the SM due to new physics (NP) contributions mediated, for instance, by heavy Z gauge bosons, leptoquarks or non-SM Higgs bosons (see e.g. Ref. [8]).
An effective field theory description of b → sµ + µ − transitions makes it possible to tightly constrain the currents contributing to their amplitudes in a model-independent way. In this framework, the branching fractions of B 0 → µ + µ − and B 0 s → µ + µ − decays are sensitive to axial-vector, scalar and pseudoscalar operators and their chirality-flipped counterparts [9,10]. Of these, only the left-handed axial-vector current is present in the SM at a significant level. Examples of SM Feynman diagrams contributing to the B 0 s → µ + µ − amplitude are shown in Fig. 1(a) and 1(b). Given the low rate of the B 0 → µ + µ − and B 0 s → µ + µ − decays, their branching fractions are measured without distinguishing between B 0 (s) and B 0 (s) at production. Both B 0 and B 0 s mesons oscillate into their antiparticles but, in contrast to the B 0 system, the light and heavy mass eigenstates of B 0 s mesons are characterised by a sizeable difference between their decay widths, ∆Γ s = 0.085 ± 0.004 ps −1 [11], and thus have different lifetimes. This gives rise to the relation [12] B(B 0 s → µ + µ − ) = 1 + A µµ ∆Γs y s 1 − y 2 between the flavour-untagged and time-integrated branching fraction and the value at decay time t = 0. The theoretical prediction mentioned at the beginning of this section includes this correction. In Eq. 1, y s ≡ ∆Γ s /(2Γ s ) = 0.065 ± 0.003 [11] and the parameter A µµ ∆Γs is defined as A µµ ∆Γs ≡ −2 (λ)/(1 + |λ| 2 ), with λ = (q/p)(A(B 0 s → µ + µ − )/A(B 0 s → µ + µ − )). The complex coefficients p and q relate the mass and CP eigenstates of the B 0 s − B 0 s system with the flavour eigenstates B 0 s and B 0 s (see, e.g., Ref. [11]), and A is the amplitude of the process. In the SM, only the CP -odd eigenstate (which, except for small modifications from CP violation, corresponds to the heavy mass eigenstate) decays to µ + µ − and the quantity A µµ ∆Γs is equal to unity. However, in the presence of NP contributions it can assume any value in the range −1 ≤ A µµ ∆Γs ≤ 1 [12]. Thus the B 0 s → µ + µ − branching fraction might differ from the SM prediction in either of the two factors in the right-hand side of Eq. 1.
The B 0 s → µ + µ − effective lifetime is defined as where t is the decay time of the B 0 s or B 0 s meson, the decay-time distribution Γ(B s (t) → µ + µ − ) for B 0 s → µ + µ − and B 0 s → µ + µ − decays with and without os- cillations is defined as Γ(B s (t) → µ + µ − ) ≡ Γ(B 0 s (t) → µ + µ − ) + Γ B 0 s (t) → µ + µ − , and τ B 0 s = 1.515 ± 0.004 ps [11] is the mean B 0 s lifetime. By measuring the B 0 s → µ + µ − effective lifetime, the contribution of each mass eigenstate, and thus the CP structure of the interaction involved in the decay, can be inferred, and a direct evaluation of A µµ ∆Γs can be performed. The lifetime thus makes it possible to discriminate between contributions from scalar or pseudoscalar interactions in a complementary way to the branching ratio. Similar effects are not significant for B 0 → µ + µ − decays due to the negligible decay width difference of the B 0 mass eigenstates.
The ratio of the B 0 → µ + µ − and B 0 s → µ + µ − branching fractions also provides powerful discrimination between NP theories [13]. This quantity is theoretically more precise than the two individual branching fractions due to the cancellation of common theoretical uncertainties. It can be obtained as where τ B 0 is the lifetime of the B 0 , Γ s H is the width of the heavy-mass eigenstate of the B 0 s meson, M B 0 and M B 0 s are the masses of the B 0 (s) mesons, f B 0 and f B 0 s are the B 0 (s) meson decay constants, V td and V ts are Cabibbo-Kobayashi-Maskawa (CKM) matrix elements and m µ is the mass of the muon. In the SM, R µ + µ − is predicted to be 0.0281 ± 0.0016 [7] and it assumes the same value in all NP models with the same flavour structure as the SM [14].
The B 0 s → µ + µ − γ decay is also rare in the SM. Compared to the B 0 s → µ + µ − amplitude, the additional suppression arising from the photon is compensated by the fact that the amplitude is no longer helicity suppressed, increasing the total predicted branching fraction to O(10 −8 ) [15][16][17][18][19][20][21]. Two groups of amplitudes contribute to this decay: those where the photon is emitted from the initial state (initial-state radiation or ISR), an example of which is shown in Fig. 1(c), and those in which it is emitted from the final state (final-state radiation, FSR), as in Fig. 1(d). Their interference is evaluated to be negligible due to their combined helicity and kinematic suppression [18,19,22]. The FSR contribution to the B 0 s → µ + µ − γ process is experimentally included in the B 0 s → µ + µ − decay through the description of the radiative tail in its mass distribution. The ISR component is sensitive to a wider range of interactions and is treated as a separate contribution to the mass fit. In the mass region of interest its contribution decreases as the mass increases, becoming null for values larger than the B 0 s mass. Similar to other multibody b → s decays, the sensitivity to different interactions depends on the dimuon mass squared, q 2 , of the decay. At low q 2 , the decay is mostly sensitive to tensor and pseudotensor interactions, while at high q 2 vector and axial-vector contributions dominate [23,24]. This makes the ISR B 0 s → µ + µ − γ decay at high q 2 an ideal place to probe the same interactions that drive the anomalies seen in some b → s decays [25][26][27][28]. In the rest of this paper, B 0 s → µ + µ − γ refers only to the ISR process.
Measurements of B 0 → µ + µ − and B 0 s → µ + µ − processes have attracted considerable experimental interest since the first search for these decays at the CLEO experiment [29] almost forty years ago. The first evidence for the B 0 s → µ + µ − decay was obtained at LHCb [30] with data corresponding to 2 fb −1 of proton-proton (pp) collisions, and the decay was then observed through a combined analysis of data taken by the LHCb and CMS experiments [31]. Subsequent measurements were performed by the LHCb collaboration [32] with 4.4 fb −1 , by the ATLAS collaboration [33] with 51.3 fb −1 , and by the CMS collaboration [34] with 63 fb −1 . These last three measurements are combined in Ref. [35], yielding B(B 0 s → µ + µ − ) = 2.69 + 0.37 − 0.35 × 10 −9 and an upper limit on the B 0 → µ + µ − decay of B(B 0 → µ + µ − ) < 1.9 × 10 −10 at 95% confidence level (CL). In the two-dimensional plane of B(B 0 s → µ + µ − ) and B(B 0 → µ + µ − ), the consistency of the profile likelihood minimum with the SM prediction is measured to be 2.1 standard deviations (σ). To date, no experimental search has been performed for the B 0 s → µ + µ − γ decay, while the corresponding B 0 decay has been probed by the BaBar experiment, yielding B(B 0 → µ + µ − γ) < 1.5 × 10 −7 at 90% CL in the whole q 2 region [36], which is well above the SM prediction.
This paper presents improved measurements of the B 0 s → µ + µ − time-integrated branching fraction and effective lifetime, as well as a search for the B 0 → µ + µ − decay, superseding the results in Ref. [32]. Moreover, a first search for the B 0 s → µ + µ − γ decay at high dimuon mass is also presented. These results, also reported in Ref. [37], are based on data collected with the LHCb detector, corresponding to an integrated luminosity of 1 fb −1 of pp collisions at a centre-of-mass energy √ s = 7 TeV, 2 fb −1 at √ s = 8 TeV and 6 fb −1 at √ s = 13 TeV. The first two data sets are referred to as Run 1 and the latter as Run 2. Throughout this paper, B 0

Analysis Strategy
The signature of B 0 (s) → µ + µ − decays in the LHCb detector consists of two oppositely charged muons with a dimuon mass in the B 0 s or B 0 mass region, and a decay vertex displaced with respect to any pp interaction vertex as a result of the significant average flight distance of the B mesons. The B 0 s → µ + µ − γ channel is searched for with the same signature, without reconstructing the photon, as was proposed in Ref. [22].
The main background can be divided into two categories: combinatorial background arising from random combinations of muons from two distinct b-hadron decays in the same event, and physical background comprising b-hadron decays where one or more final state particles has either been misidentified as a muon or not reconstructed. Combinatorial background candidates are distributed across the entire search region from low to high mass, while physical background contributions tend to populate the region below the B 0 s mass. The dominant physical background sources are: B 0 (s) → h + h − decays where the hadrons h, h = K, π are misidentified as muons, which mainly contribute to the B 0 mass region; and partially reconstructed b-hadron decays, which populate the same lower dimuon mass region as the B 0 s → µ + µ − γ signal. The most important partially reconstructed background sources are semileptonic Combinatorial background is separated from the B 0 (s) → µ + µ − signal by exploiting differences between their topologies and the relative isolation from other tracks in the event of the muons forming the B candidate. This information is combined in a multivariate classifier based on a boosted decision tree [38], the output response of which, BDT, is used to classify the events as described in Sec. 4. A stringent particle identification (PID) requirement is used to suppress physical background from B 0 (s) → h + h − and semileptonic decays, described in Sec. 7.
As presented in Sec. 8, the signal yields are estimated using an extended unbinned maximum-likelihood fit to the dimuon mass distribution, which is performed simultaneously in intervals of the BDT response to increase the sensitivity of the measurement. The BDT and mass distributions of the signals are calibrated and validated using data, as detailed in Sec. 5.
The measurement of the B 0 s → µ + µ − effective lifetime uses a similar selection strategy, which is optimised to achieve the highest sensitivity. After the selection, a maximumlikelihood fit is performed to the dimuon mass distribution in two BDT regions to subtract the background. The decay-time acceptance in each BDT region is calibrated on corrected simulation samples and validated by applying the analysis procedure to B 0 (s) → h + h − candidates from data. Finally, the effective lifetime is extracted using a maximum-likelihood fit to the background-subtracted decay-time distribution, performed simultaneously across both BDT regions, as presented in Sec. 9.

Detector and simulation
The LHCb detector [39,40] is a single-arm forward spectrometer covering the pseudorapidity range 2 < η < 5, designed for the study of particles containing b or c quarks. The detector includes a high-precision tracking system consisting of a siliconstrip vertex detector surrounding the pp interaction region, a large-area silicon-strip detector located upstream of a dipole magnet with a bending power of about 4 Tm, and three stations of silicon-strip detectors and straw drift tubes placed downstream of the magnet. The tracking system provides a measurement of the momentum, p, of charged particles with a relative uncertainty that varies from 0.5% at low momentum to 1.0% at 200 GeV/c. The minimum distance of a track to a primary pp collision vertex (PV), the impact parameter (IP), is measured with a resolution of (15 + 29/p T ) µm, where p T is the component of the momentum transverse to the beam, in GeV/c. Different types of charged hadrons are distinguished using information from two ring-imaging Cherenkov detectors. Photons, electrons and hadrons are identified by a calorimeter system consisting of scintillating-pad and preshower detectors, an electromagnetic and a hadronic calorimeter. Muons are identified by a system composed of alternating layers of iron and multiwire proportional chambers. The online event selection is performed by a trigger, which consists of a hardware stage, based on information from the calorimeter and muon systems, followed by two software stages. The first software stage performs a preliminary event reconstruction using only part of the available event information, while the second stage performs a full event reconstruction.
Simulation is used to estimate the acceptance, reconstruction and selection efficiencies and to optimise the analysis strategy. The pp collisions are generated using Pythia [41] with a specific LHCb configuration [42]. Decays of particles are described by EvtGen [43]. Decays of B + c meson are generated using the dedicated Bcvegpy generator [44,45].Finalstate radiation in the decay of particles is simulated using Photos [46], which is observed to agree with a full quantum electrodynamics calculation at the level of 1% [47]. The interaction of the generated particles with the detector, and its response, are implemented using the Geant4 toolkit [48], as described in Ref. [49].

Signal selection
In the online event selection, signal candidates are first required to pass the hardware trigger, which selects events with at least one muon with high transverse momentum, followed by a two-level software stage, which applies a full event reconstruction. The software stage imposes minimum requirements on the muon transverse momentum and impact parameter with respect to all PV. However, to maximise the signal selection efficiency, events triggered by particles not related to the signal candidates are also retained for further analysis.
Candidate B 0 (s) → µ + µ − decays are selected offline by combining two well-reconstructed oppositely charged particles identified as muons [50], with transverse momentum in the range 0.25 < p T < 40 GeV/c, and momentum p < 500 GeV/c. The muon candidates are required to form a secondary vertex (SV) with a vertex-fit χ 2 per degree of freedom smaller than 9 and separated from any PV with a significance greater than 15. Only muon candidate tracks with χ 2 IP > 25 for any PV are selected, where χ 2 IP is defined as the difference between the vertex-fit χ 2 of the PV formed with and without the particle in question.
The resulting B 0 (s) candidates must have a decay time lower than 13.25 ps, χ 2 IP < 25 with respect to the PV for which the χ 2 IP is minimal (henceforth referred to as the PV associated with the B 0 (s) candidate) and p T > 0.5 GeV/c. To suppress the B + c → J/ψµ + ν µ background, a B 0 (s) candidate is rejected if either of the two candidate muons combined with any other oppositely charged muon candidate in the event has a mass within 30 MeV/c 2 of the J/ψ mass [51] (J/ψ veto). Further requirements on the particle identification (PID) information of the two muons are imposed in order to reject misidentified hadronic background. PID identification uses multivariate techniques to combine information from different subsystems taking correlations into account [40].
Candidate B 0 (s) → µ + µ − decays used in the branching fraction measurements are selected in the dimuon mass range 4900 ≤ m(µ + µ − ) ≤ 6000 MeV/c 2 , while those used in the lifetime measurement are selected in a narrower range, 5320 ≤ m(µ + µ − ) ≤ 6000 MeV/c 2 . The reduced mass range used in the lifetime measurement excludes most of the B 0 → µ + µ − and physical background decays that populate the lower dimuon mass region, greatly simplifying the fit and making it possible to impose less stringent PID requirements used to reject misidentified background, thus increasing the signal selection efficiency. To avoid potential biases, the candidates in the mass region 5200 ≤ m(µ + µ − ) ≤ 5445 MeV/c 2 , where B 0 (s) → µ + µ − candidates peak, were not examined until the analysis procedure was finalised.
In addition to the signal channels, B 0 (s) → h + h − and B + → J/ψK + decays are selected as normalisation and control channels, and B 0 s → J/ψφ as control channel. Candidate B 0 (s) → h + h − decays are selected using the same requirements as the signal channels, except that the muon identification criteria are replaced with hadron identification, the events are triggered independently of the decay final state, and the J/ψ veto is not applied. Candidate B + → J/ψK + and B 0 s → J/ψφ decays are formed by combining a muon pair, with mass close to the J/ψ mass [51], with one track (B + → J/ψK + ) or two oppositely charged tracks consistent with originating from a φ decay (B 0 s → J/ψφ), with the kaon mass hypothesis assigned. All tracks forming B + → J/ψK + and B 0 s → J/ψφ candidates are selected with the same requirements as those applied to select the B 0 (s) → µ + µ − candidates, except for the dimuon mass range and the particle identification criteria. The muons are only required to pass the muon system identification criteria [50] and no kaon identification criteria are required, given the already excellent signal purity achieved. The same trigger strategy as for the signal decays is used for these two channels.
Background events are further rejected using a loose requirement on the response of a boosted decision tree [38,52,53], which was first described in Ref. [54] and has remained unchanged. The classifier takes as input: the angle between the direction of the momentum of the B 0 (s) candidate and the direction defined by the vector joining the primary and the secondary vertices; the B 0 (s) candidate IP and its vertex χ 2 ; the minimum IP of the muons with respect to any PV; the minimum distance between the two muon tracks; the χ 2 of the SV. This classifier is also applied to the control and normalisation channels, where for the B + → J/ψK + and B 0 s → J/ψφ modes the χ 2 of the SV is replaced with that of the J/ψ vertex. The selected sample of B 0 (s) → µ + µ − candidates is dominated by random combinations of two muons (combinatorial background), mainly originating from semileptonic decays of two different b hadrons. The reconstruction and selection efficiencies for the signal and normalisation modes are reported in Sec. 6.3.
Isolation variables, which quantify the possibility that other tracks in the event originated from the same hadron decay as the signal muon candidates, are constructed in order to further reject background. Most combinatorial background candidates arise from semileptonic b-hadron decays, where other charged particles produced in the decay may be reconstructed close to the signal muon candidate. Two isolation variables are designed to recognise these particles, each considering a different type of track: one uses additional tracks that have been reconstructed both before and after the magnet (long tracks), while the other considers tracks reconstructed only in the vertex detector (VELO tracks). These isolation variables are defined based on the proximity of the two muons forming the B 0 (s) candidate to other tracks in the event. The closeness of each muon candidate to either a long track or a VELO track is measured using two dedicated multivariate classifiers that take the following quantities as inputs: the minimum χ 2 IP of the track with respect to any PV; the signed distance between the muon-track vertex and the PV associated to the B 0 (s) candidate; the signed distance between the muon-track vertex and the B 0 (s) decay vertex; the distance of closest approach between the track and the muon; the angular separation between the track and the muon; a quantity that measures the compatibility of the muon-track system with having originated from the PV associated to the B 0 (s) candidate. The long-track isolation classifier takes three additional variables as input: the absolute difference between the azimuthal angles of the track and the muon; the absolute difference between the pseudorapidities of the track and the muon; and the track p T . The classifiers are trained on collections of track-muon pairs from simulated B 0 s → µ + µ − decays and from simulated bb → µ + µ − X events. The latter sample includes decays with two muons originating from two different b-hadrons or from the same b-hadron. In both cases the muons can either originate directly from the hadron containing the b-quark or from intermediate resonances. Only tracks originating from the same b-hadron decay as the muon candidate are considered to train the classifier with the simulated bb → µ + µ − X events. The output value of the classifiers for a given track-muon pair is defined to be higher when the track is "closer" to the muon. Defining I(µ ± ) as the maximum value of a given classifier over all the track-muon pairs in the event, the long and VELO-track isolation variables are each defined as I(µ + ) + I(µ − ).
Signal and background events are separated using a final boosted decision tree classifier that combines kinematic, topological and isolation information, defined as in the previous measurement [32]. The BDT response is used to divide the data into samples of varying signal purity, which are then fitted simultaneously as described in Secs. 8 and 9. The classifier is trained using simulated samples of B 0 s → µ + µ − decays as signal and of inclusive bb → µ + µ − X events as proxy for the combinatorial background. It combines information from the following input variables: ∆φ 2 + ∆η 2 , where ∆φ and ∆η are the azimuthal angle and pseudorapidity differences between the two muon candidates; the minimum χ 2 IP of the two muons with respect to the B 0 (s) associated PV; the angle between the B 0 (s) candidate momentum and the vector joining the B 0 (s) decay vertex and B 0 (s) associated PV; the B 0 (s) candidate vertex-fit χ 2 ; the B 0 (s) impact parameter significance with respect to the B 0 (s) associated PV; the long-and VELO-track isolation variables. The BDT classifier response is defined to have an approximately uniform distribution in the range 0 ≤ BDT ≤ 1 for signal, and to peak at zero for background. Its correlation with the dimuon mass is below 5%. The branching fraction measurement is performed by dividing   the Run 1 and Run 2 data samples into six subsets each, based on regions in the BDT response with boundaries 0, 0.25, 0.4, 0.5, 0.6, 0.7 and 1. Figure 2 shows the expected BDT distribution for B 0 s → µ + µ − decays, as determined in Sec. 5.2, and combinatorial background. The sample with 0 ≤ BDT < 0.25 is discarded as it is dominated by background. The data used in the measurement of the B 0 s → µ + µ − effective lifetime are split into two regions in the BDT response with the ranges 0.35 ≤ BDT < 0.55 and 0.55 ≤ BDT ≤ 1, which are chosen to minimise the expected statistical uncertainty on the effective lifetime based on the results of pseudoexperiments.

Signal calibration
The dimuon mass and the BDT classifier response are used to separate signal from background in the determination of the branching fractions and the B 0 s → µ + µ − effective lifetime. It is therefore essential that these variables are accurately calibrated in order to account for possible discrepancies between data and simulation. The calibration procedures for these two variables are described in the following sections.

Mass shape calibration
The mass shape of the B 0 s → µ + µ − and B 0 → µ + µ − signals is described with a double-sided Crystal Ball (DSCB) function [55] f (m|µ, σ, α l , n l , α r , n r ) = N where m is the dimuon mass and all the parameters are positive. The function has a Gaussian core with mean µ and resolution σ and power-law tails on both sides defined by two starting points in units of σ, α l and α r , and two slopes, n l and n r for the left and right side, respectively. The mean of the B 0 s → µ + µ − and B 0 → µ + µ − signal peaks are calibrated with data samples containing B 0 s → K + K − and B 0 → K + π − decays, respectively. Besides  the contamination from combinatorial background, the B 0 → K + π − sample contains contributions from B 0 s → K − π + decays and partially reconstructed background decays, while the B 0 Fig. 3, are modelled with a DSCB function. The difference between the B 0 and B 0 s mass values, taken from Refs. [51,[56][57][58], is used to constrain the B 0 s → K − π + mean with respect to the B 0 → K + π − mode. The mass resolutions for the two modes are constrained using the mass resolution calibration described below. The combinatorial background is modelled with an exponential shape with its slope parameter left to vary freely. The partially reconstructed background component is described by an ARGUS shape [59], while the component for Λ 0 b → ph − decays is modelled as the sum of two Crystal Ball (CB) functions [55], with all parameters, except the total yield, fixed from simulation. The results of the fits are shown in Fig. 3. To check the correlation between the PID selection and the mean of the B 0 (s) → h + h − signal peak, and to study the effect of possible contamination from misidentified background, the fit is repeated after tightening the PID requirements. The variation of the mean value is assigned as a systematic uncertainty. The mean is found to be uncorrelated with the BDT response.

Data 6 fb Total
The mass resolution is calibrated with data samples containing charmonium (J/ψ, ψ(2S)) and bottomonium (Υ (1S), Υ (2S) and Υ (3S)) resonances decaying into two muons, selected similarly to the signal. The natural widths of all these resonances are negligible compared to the mass resolution of the LHCb experiment. The resolution of each resonance is obtained from a mass fit to the data. The distributions of the dimuon mass, m µ + µ − , of quarkonium decays, shown in Fig. 4, are modelled with a DSCB function. The combinatorial background is modelled with an exponential shape. In the bottomonium fits, the tails are constrained from simulation. A second-order Chebychev polynomial is used as alternative shape for the combinatorial background. The difference between the mass resolutions measured with the two background descriptions is taken as systematic uncertainty. The power-law function σ µ + µ − (m µ + µ − ) = a 0 + a 1 · (m µ + µ − ) a 2 is found to describe the mass resolution of simulated Drell-Yan events accurately. This function is fitted to the measured resolutions of the quarkonia, including their systematic uncertainties, and used to determine the mass resolution at the B 0 s and B 0 mass, as shown in Fig 5. The mass resolution in Run 2 is found to be slightly better than in Run 1, which is explained by improvements in the track reconstruction. The Υ (3S) resolution is larger than expected from the power-law function and thus an interpolation with a third-order polynomial was also performed. The differences of the interpolated B 0 and B 0 s mass resolutions with respect to their default widths are assigned as a systematic uncertainty.
The four tail parameters of the B 0 (s) → µ + µ − signal shape are in common between the B 0 and B 0 s decays. They are determined from the mass distributions in simulation, after convoluting them with a Gaussian function to match their core resolutions with the values found in data as described in the previous paragraph.
Each step of the mass calibration is performed separately for each data-taking year. Subsequently, average results for the mass shape parameters for Run 1 and Run 2 are calculated by weighting the year-by-year values by the integrated luminosity in each year. These averages are then used in the final mass fit, with their values shown in Table 1. A small but significant dependence of the mass resolution with the BDT response is found, for which correction factors ranging from 0.97 to 1.03 are applied. For the left-hand tail parameters, BDT-dependent correction factors ranging from 0.9 to 1.2 are obtained. Table 1: Luminosity-weighted signal mass shape parameter combinations per data set, including propagated uncertainties. Where appropriate, statistical and systematic uncertainties are added in quadrature. As the tail parameters determined in Run 1 and Run 2 are consistent, they are combined into common estimates.

Data 6 fb Total Combinatorial
candidates in (left) Run 1 and (right) Run 2 data. The result from the fit to determine the mass resolutions to each sample is overlaid, and the components are detailed in the legend.

BDT calibration
The BDT response is determined using simulated decays, to which corrections are applied to account for possible discrepancies between simulation and data. This calibration procedure is performed in three steps. Firstly, the B-meson kinematics and the detector occupancy of the simulation are corrected using control channels in data. Secondly, the effect of the PID and trigger selections are evaluated on data and used to correct the BDT response. Finally, since the BDT response for real B 0 s → µ + µ − decays is strongly correlated with their decay-time distribution and hence with the B 0 s → µ + µ − effective lifetime, which is a priori unknown, an additional correction is applied under different effective lifetime hypotheses. Taking these three correction factors into account, the fraction of signal decays in each BDT region, f BDT,i , can be expressed as where f µµ sim,i is the fraction of events per BDT region in the corrected simulation, f µµ PID,i and f µµ trig,i are the weights used to correct for the PID and trigger selections, respectively, and the effective lifetime correction k i is used for The B meson kinematics of the simulated signals are corrected using a gradient boosting reweighter. This technique consists of training a boosted decision tree classifier to align two samples, in this case data and simulation, as described in Ref [60]. The transverse momentum p T , the pseudorapidity η, and the χ 2 IP of the B candidate are used as input variables for the gradient boosting reweighter, as these are the variables required to correct the simulation. The weights obtained from this procedure are applied to all simulation samples used for calibration and normalisation. The kinematic distributions for B 0 and B 0 s mesons differ, as determined in hadronisation fraction measurements [61], thus they are corrected with a sample of B + → J/ψK + and B 0 s → J/ψφ decays, respectively. An additional correction to the BDT distribution shape stems from the event occupancy, measured as the number of tracks in the event. As this affects the muon track isolation variables, which are important inputs to the BDT classifier, the correction is determined in four intervals of the total number of reconstructed tracks. The correction weights are determined by comparing the relative number of B + → J/ψK + decays in backgroundsubtracted data and simulated samples in these intervals. It is ensured that the input variable distributions of B + → J/ψK + candidates match those of B 0 (s) → µ + µ − candidates as closely as possible by evaluating the BDT response based on the final state muons and the B + candidate, with two exceptions: the final state kaon is excluded from the calculation of the isolation variables, and the decay vertex χ 2 is determined on the J/ψ candidate, to match the number of degrees of freedom of the signal.
The PID efficiency correction per BDT region for B 0 (s) → µ + µ − decays, f µµ PID,i , is determined on dedicated calibration samples and convolved with the muon kinematics of simulated signal per region, as described in Sec. 6.3. As the total PID efficiency is part of the normalisation (Sec. 6), the PID efficiency correction of the BDT response is determined as the relative PID efficiency per BDT region; no uncertainty is assigned on this correction, as it is already included in the total efficiency.
A similar procedure is adopted for the trigger efficiency per BDT regions, f µµ trig,i . The trigger efficiencies are determined on data samples containing B + → J/ψK + decays with the trigger calibration method (see Sec. 6.3) [62]. They are calculated in ranges of the maximum p T of the two muons and the product of the p T of the two muons, which are the variables used for the muon and dimuon hardware trigger. The trigger efficiency per BDT region is determined by the convolution of the obtained efficiencies with the kinematics of simulated signal. The details of this method, also employed for the full efficiency determination, are given in Sec. 6.3.
In simulation, B 0 s → µ + µ − (γ) decays are generated using the mean B 0 s lifetime, while the effective lifetime can have any value between the lifetime of the light and the heavy mass eigenstates. As the BDT classifier is correlated with the B 0 s candidate decay time, an additional correction is included for the B 0 s → µ + µ − (γ) BDT response distributions. The correction is evaluated for A µµ ∆Γs = −1, 0 and 1, corresponding to τ µ + µ − = 1.423, 1.527 and 1.620 ps, and covering the full physically allowed range. Simulated candidates selected with the procedure described in Sec. 4 are weighted according to where t j is the reconstructed decay time of the candidate j, τ gen is the lifetime used for generation, and τ µ + µ − is the effective lifetime calculated from y s , τ B 0 s and A µµ ∆Γs . A correction factor, k i , is then calculated for each BDT region according to where N i is the number of signal decays in each BDT region. The calibrated BDT response for B 0 s → µ + µ − decays under the SM hypothesis A µµ ∆Γs = 1 and for B 0 → µ + µ − decays are shown in Fig. 6. The main systematic uncertainties on the calibrated BDT distribution arise from the limited samples used for the trigger efficiency and the event occupancy corrections; they are summed in quadrature with each other and with the statistical uncertainties to determine the total uncertainty on the BDT distribution. The systematic uncertainty on the trigger efficiency correction has been evaluated using B + → J/ψK + as a control channel and is related to possible discrepancies between the signal decay and the control channel and to mismodelling in the simulation used in the TISTOS method described in Sec. 6.3. The systematic uncertainty due to the event occupancy is obtained by comparing the correction obtained with the default and an alternative interval scheme used to determine the occupancy weights. The systematic uncertainties from the PID selection correction and kinematic reweighting are found to be negligible.
To validate the BDT calibration procedure, an alternative calibration is performed using a data sample containing B 0 (s) → h + h − decays. While this procedure directly measures the BDT distribution on data and the BDT response is expected to be very similar for any two-body B decay, B 0 (s) → h + h − decays require significant corrections to be compared to B 0 (s) → µ + µ − decays, making it less precise than the default strategy. The two most frequent B-meson decays into two hadrons, candidates as for the signal, except for the trigger and the particle identification requirements. A trigger selection independent of the B 0 (s) decay products is applied to the candidates to select B 0 (s) → h + h − decays in order to avoid selection biases. Then, a PID selection is applied is determined by fitting the mass distribution of the two hadrons with the corresponding mass hypothesis. A detailed description of the mass fit for B 0 (s) → h + h − decays is given in Sec. 5.1. To avoid correlated uncertainties, the BDT distribution from B 0 (s) → h + h − data corrected for PID and trigger efficiencies is compared with the distribution of B 0 (s) → µ + µ − decays from corrected simulation samples. The B 0 (s) → h + h − PID efficiencies are determined with a dedicated procedure [63], while the trigger efficiency is evaluated on B + → J/ψK + decays with the trigger calibration method reported in Sec. 6.3 for the hardware trigger and first software trigger requirements, and determined with B 0 (s) → h + h − simulated samples for the second software trigger selection.
Therefore, the fraction of B 0 (s) → h + h − decays per BDT region, f BDT,i , can be described as where f hh PID,i and f hh trig,i are the relative PID and trigger efficiencies for B 0 The corrected B 0 → K + π − and B 0 s → K + K − distributions are compared with those determined on corrected simulated B 0 → µ + µ − and B 0 s → µ + µ − samples, respectively, as shown in Fig. 7. The BDT distributions of B 0 s → µ + µ − and B 0 s → K + K − decays are compared for the same effective lifetime, namely for A µµ ∆Γs = 1. Because of the good agreement between the two different methods used to calibrate the BDT response, no additional systematic uncertainty is assigned to the BDT distribution. s → µ + µ − and B 0 → µ + µ − decays, the distributions are determined on corrected simulated samples, as described in the text, and are shown in black. The B 0 s → K + K − and B 0 → K + π − distributions are determined using fits to data as described in the text and are shown in red.

Normalisation
The branching fractions of the signal channels are estimated by comparing their yields in data with those of two normalisation channels with well-known branching fractions, B + → J/ψK + and B 0 → K + π − , according to where B, ε and N are the branching fraction, efficiency and yield of the corresponding channel and f sig(norm) indicates the fragmentation fraction of the relevant B meson. Signal candidates having BDT < 0.25 are not included in the fit to the dimuon mass distribution. The parameter α sig is the single-event sensitivity. In the following, the different elements entering Eq. 9 are described. The final single-event sensitivity is obtained for each signal channel as the weighted average of those obtained with the two normalisation channels, taking the correlations between the inputs into account. The branching fractions of the two normalisation channels are taken as B(B + → J/ψK + ) = (6.02 ± 0.17) × 10 −5 [64][65][66], including the J/ψ → µ + µ − branching fraction, and B(B 0 → K + π − ) = (1.96 ± 0.05) × 10 −5 [51,[67][68][69][70][71][72][73][74][75]. The normalisation for the B 0 s → µ + µ − γ decay is calculated only in the region m µ + µ − > 4.9 GeV/c 2 where the branching fraction is measured.

Normalisation channel yields
The yields of the normalisation channels are obtained through unbinned extended maximum-likelihood fits to the mass distributions of the candidates for each data-taking year separately, after the corresponding selection described in Sec. 4. To improve the mass resolution of B + → J/ψK + candidates, the J/ψ mass is constrained to its known value [51,[76][77][78][79]. The mass distributions for selected B + → J/ψK + candidates are shown in Fig. 8 for the different data taking periods. The mass distribution of signal B + → J/ψK + decays is described by a Hypatia function [80] with parameters Gaussian-constrained to the values derived from simulation within their uncertainties, except for the mean and width, which are free to vary in the fit. In addition to signal, the selected candidates contain a contribution of combinatorial background and a small contamination from the B + → J/ψπ + decay, where a pion is misidentified as a kaon. The mass distribution of the combinatorial background is described by an exponential function with the slope left free to vary in the fit. The B + → J/ψπ + decay, which is expected to peak at higher mass values than the B + mass due to assigning the kaon mass to a pion track, is described by an analytical function developed in Ref. [81]. This function is obtained by transforming a Gaussian mass distribution under the pion hypothesis to one under the incorrectly-assigned kaon mass hypothesis, using an analytical description of the candidate kinematics. As an alternative model a non-parametric function tuned on a simulated B + → J/ψπ + sample, where the events were reconstructed under the B + → J/ψK + hypothesis as in Ref. [82], has been used to cross check the results. The fits with the two different descriptions for B + → J/ψπ + decays give compatible results for the B + → J/ψK + yields.
Additional possible background from B + c → J/ψK + K − π + and Λ 0 b → J/ψpK − decays has been investigated and found to be small and evenly distributed in the considered mass range, and hence is considered negligible.
The yield of B 0 → K + π − candidates is determined with a binned maximum-likelihood fit to the data, using the fit model for B 0 → K + π − candidates described in Sec. 5.1. In contrast to the mass calibration, events are required to be triggered independently of the B 0 → K + π − signal, such that the trigger efficiency for this hadronic channel can be determined on B + → J/ψK + decays with the same TISTOS method that is used to determine the signal efficiency. The mass distribution of B 0 → K + π − candidates is shown in Fig. 9 together with the result of the fit, for the different data-taking years. The yields of the two normalisation channels are reported in Table 2, for the different data-taking years and for the two data-taking periods combined.  it is found not to be dependent on pseudorapidity. Here only the integrated values at different energies are used since the average p T of b-hadrons in this analysis is found to be compatible with those used in the determination of the fragmentation fractions [61].   Following the approach adopted in Ref. [61], isospin symmetry is assumed to hold in b quark hadronisation at the LHC such that f u = f d , and hence the same f s /f d values are used in relation to the B + → J/ψK + normalisation channel.

Efficiencies
The efficiencies to detect the signal and normalisation channels can be factorised as  The acceptance, reconstruction and selection efficiencies are evaluated using simulation with corrections applied to improve the agreement with data. The efficiency to detect and reconstruct tracks is evaluated on a sample of J/ψ → µ + µ − decays in data [87], using a tag-and-probe method. These samples are used to determine efficiency correction factors as a function of the particle kinematics, which are convolved with the simulated samples to calculate the total efficiency correction. The corrections are at the level of 1% for all channels and data-taking years. When considering the ratio of signal and normalisation channels in the normalisation formulae (Eq. 9), uncertainties on these corrections are treated as 100% correlated. The total efficiencies for the reconstruction within the LHCb detector and selection are listed for the relevant channels in Table 3, where the efficiency with which the muon system detects muons is included. Correction factors for the imperfect modelling of the muon system efficiency simulation are estimated using a sample of B + → J/ψK + decays in data, which are selected without particle identification criteria [63,88]. These corrections are applied to the signal, B + → J/ψK + and B 0 s → J/ψφ channels, modifying the efficiencies by 1-3%. The efficiency of the PID requirements described in Sec. 4 is measured using highpurity control samples of each particle species obtained from data [63,88]. These control samples are obtained by means of kinematic requirements only, with muons obtained from J/ψ → µ + µ − and B + → J/ψK + decays, pions and kaons from D 0 → K − π + decays selected via D * + → D 0 π + , and protons from Λ → pπ − and Λ + c → pK − π + decays. The muon PID efficiencies are evaluated as a function of the muon momentum and pseudorapidity, as well as the track multiplicity of the event using a dedicated procedure [63]. The resulting efficiency maps are then applied to simulated samples to determine the integrated efficiency for a specific channel. The efficiency measurements for the different hadronic species are described in Sec. 7.1. The results for the signal and normalisation channels are shown in Table 4 and include for the channels with muons the data-simulation correction of the muon system identification. For B + → J/ψK + candidates, only the correction to the muon system identification efficiency is computed, as no further requirements on Table 3: Efficiencies of reconstruction within the LHCb detector and selection for the signal and normalisation channels, averaged for the two running periods. The uncertainties include the statistical uncertainty from the simulated samples and the uncertainty of the tracking efficiency corrections.

Run 1
Run 2 0.0462 ± 0.0007 0.0500 ± 0.0006 B + → J/ψK + 0.0290 ± 0.0003 0.0305 ± 0.0003 Table 4: Particle identification efficiencies for the signal and normalisation channels, averaged for the two running periods, where the first uncertainty is statistical and the second systematic. A data-simulation correction of the muon-system identification is included for channels with muons. For the B + → J/ψK + channel only the data-simulation correction part of the muon identification is reported, as no multivariate PID requirement is applied to this channel. ε PID

Run 1
Run 2 0.8518 ± 0.0007 ± 0.0063 0.8759 ± 0.0004 ± 0.0046 B 0 s → µ + µ − γ 0.8487 ± 0.0006 ± 0.0088 0.8785 ± 0.0003 ± 0.0064 B 0 → K + π − 0.4741 ± 0.0049 ± 0.0010 0.5004 ± 0.0027 ± 0.0012 B + → J/ψK + 1.0096 ± 0.0005 1.00260 ± 0.00018 the multivariate PID classifier are applied when selecting these decays. The systematic uncertainties arise from modelling the dependencies of the PID efficiency maps. The trigger efficiencies are determined from data with the TISTOS method [62]. Trigger information is associated to the reconstructed candidates during the offline processing. The event of a reconstructed signal candidate can be classified into three categories: events triggered on signal (TOS), triggered on part of the underlying event that is independent of the tracks forming the signal candidate (TIS), or triggered on both elements of the signal candidate and the underlying event.
The trigger efficiency can be estimated by exploiting the overlap between the TIS and TOS categories (TIS&TOS) and assuming signal decays uncorrelated with the rest of the event. The trigger efficiency, ε trig , of a given decay channel, with respect to a total of N Tot events, can be computed as where N X is the number of background-subtracted candidates triggered within the category X and the efficiency ε TIS = N TIS&TOS /N TOS is estimated under the already mentioned independence assumption, which is verified in Ref. [62]. The trigger efficiency estimation is done in two steps: the hardware and first softwarelevel trigger efficiencies are estimated from data as described in the following; the second software-level efficiency, being aligned to the offline selection, is estimated from simulation and included in the full trigger efficiency presented in this paragraph. The trigger efficiencies for signal and normalisation channels with muons are calibrated using the B + → J/ψK + channel. In order to reduce residual kinematic correlations between the decay in question and the rest of the event, the calibration is performed in intervals of kinematic quantities. Yields of B + → J/ψK + decays for each trigger category and different kinematic ranges are obtained by performing a mass fit as described in Sec. 6.1. Efficiency tables are obtained as a function of the maximum p T of the two muons and of the product of the p T of the two muons, as these are the variables used in the muon hardware trigger. These efficiency distributions are then convolved with the simulated samples of the relevant channels. The trigger efficiencies for the B 0 → K + π − channel are also obtained in data by measuring the TIS trigger efficiency in Eq. 11 through the more abundant B + → J/ψK + channel, since the TIS efficiencies do not depend on the control channel used to evaluate it.
The trigger efficiencies for the signal and normalisation channels in each data-taking period are presented in Table 5. The systematic uncertainty on the trigger efficiency is comprised of a number of sources. A systematic effect is associated with the choice of the mass model used for the B + → J/ψK + channel. This is estimated by fitting the B + → J/ψK + data using a double Crystal Ball function to model the signal and taking the difference with the default fit as systematic uncertainty. A second systematic uncertainty stems from the precision of the TISTOS method and is obtained by comparing the efficiency determined by applying the TISTOS method to simulation. A third source of systematic effect is due to the difference in the phase space between the B + → J/ψK + decay and the decay channel for which the trigger efficiency is evaluated. The corresponding systematic uncertainty is estimated by comparing the results of the method applied to simulated events using the B + → J/ψK + channel or the considered channel. The last source of systematic effect arises from the choice of the kinematic ranges for which the efficiencies are evaluated, and its uncertainty is determined from the change of the efficiency when these ranges are varied. The resulting shifts in trigger efficiency from each source are added in quadrature, and assigned as the total systematic uncertainty in Table 5.
The efficiencies of the exclusion of the first BDT region on the signal decays are evaluated using the calibrated BDT response described in Sec. 5.2 and are listed in Table 6, combining statistical and systematic uncertainties.

Single-event sensitivities
The single-event sensitivities (defined in Eq. 9) for the three signal channels in Run 1, Run 2 and the full data sample, are reported in Table 7. Single-event sensitivities for BDT > 0.25 are obtained using the two normalisation channels B 0 → K + π − and B + → J/ψK + separately, which are combined into a normalisation using a weighted average, taking into account the relevant correlations. In the same table, the expected number of signal candidates for BDT > 0.25 is reported, assuming the SM branching fraction.

Run 1
Run 2 Table 7: Single-event sensitivities, α(B + ), α(B 0 ) and α(Comb) for the three signal channels obtained for BDT > 0.25 with the two normalisation channels, B 0 → K + π − and B + → J/ψK + , and combined, for Run 1, Run 2 and the full data set. The first uncertainty is statistical and the second systematic. The expected yields assuming SM branching fractions, N exp , are also reported. The B 0 s → µ + µ − γ expected number does not include an uncertainty on the branching fraction. In order to cross-check the normalisation of the signal channels, the ratio of the efficiency-corrected yields of the two normalisation channels This ratio is found to be 0.340 ± 0.016 (stat) and 0.336 ± 0.012 (stat) in Run 1 and Run 2, respectively, in agreement with the ratio of the world averages of these branching fractions, 0.326 ± 0.012 [51].
To cross-check the ratio of the B 0 s and B + fragmentation fractions and its stability over the data taking, the ratio of B + → J/ψK + and B 0 s → J/ψφ efficiency-corrected yields, is also measured, following a similar approach to Ref. [82]. The ratios are found to be similar to those measured in Ref. [82], although the two methods explore different kinematic regions. A dependence on the centre-of-mass energy is seen and found to be consistent with Ref. [82] and the combined analysis of Ref. [61], justifying the use of different f s /f d values for the Run 1 and Run 2 data samples.

Background
Three classes of background events are considered in the analysis: combinatorial background; B 0 (s) → h + h − decays; semileptonic b-hadron decays. The combinatorial background, mainly composed of real muons originating from two different B decays, is modelled using an exponential function with the slope left free to float in the mass fit, as described in Sec. 8. The other background sources are included as separate components in the fit, with mass shapes evaluated on simulated events and with yields that are Gaussian-constrained to their estimated values, as explained in the following sections.

Hadron misidentification rates
To estimate the yield of physical background where one or two final-state particles are misidentified as a muon, it is crucial to perform unbiased measurements of the probability for protons, pions and kaons to pass the muon identification requirements. These measurements are carried out as a function of the track momentum and transverse momentum, using the data control samples listed in Sec. 6.3. The dedicated procedure from Ref. [63] is used for protons, while a different method is developed to determine the pion and kaon misidentification rates, using D 0 → K − π + from D * + → D 0 π + decays. For these particles, especially at low momenta, a sizeable contribution to the misidentification rate originates from hadrons decaying to muons. When the hadron decays in flight, the momentum resolution of the reconstructed track degrades by an amount that depends on the distance the hadron has travelled before decaying and on the fraction of energy inherited by the daughter muon. As a consequence, the mass distribution of the D 0 candidates broadens significantly and the efficiency to select D 0 → K − π + decays in the D 0 mass selection window, 1825 ≤ m Kπ ≤ 1910 MeV/c 2 , decreases. If this effect is not taken into account, a significant underestimation of the misidentified hadron yield would occur. The misidentification efficiencies for pions and kaons are determined by measuring the D 0 yield from a two-dimensional fit to the m Kππ − m Kπ , m Kπ distribution with or without the muon requirement applied to the particle in question. The shape of the signal D 0 mass distribution when the PID selection is applied includes the tail arising from hadron decays-in-flight, estimated from simulated events. The resulting misidentification probability is then corrected for the fraction of D 0 → K − π + decays with the K − π + mass falling outside the D 0 selection window, estimated from simulated events.
The B 0 (s) → h + h − decays can appear as background when both final state hadrons are misidentified as muons. These candidates have a broad mass distribution centred close to the B 0 mass, as determined from simulation where each of the four modes is weighted according to its expected yield.
The expected yield of doubly misidentified B 0 (s) → h + h − events is estimated by normalising to the B 0 → K + π − channel as where ε trig B 0 s →µ + µ − is the signal trigger efficiency, N TIS hh is the number of B 0 (s) → h + h − TIS events evaluated by correcting the B 0 → K + π − TIS yield by the expected fraction of this mode, ε TIS is the TIS efficiency (Sec. 5), and ε hh→µµ represents the double misidentification rate, which is estimated using data control samples (Sec. 7.1) and found to be in the range 10 −6 − 10 −5 , depending on the data set and BDT region. An independent estimate of the B 0 (s) → h + h − background is performed on πµ and Kµ combinations, selected from data samples of B candidates with two tracks in the final states applying strong muon and hadron identification requirements on the tracks. Their mass spectra are fitted and the resulting yields are scaled by the π → µ and K → µ misidentification rates. The ratio between this result and the default estimate is assigned as a correction factor to the misidentification efficiency. The estimated B 0 (s) → h + h − background yields in each BDT region with BDT > 0.25 are summarised in Table 8 for Run 1 and Table 9 for Run 2 data.
For each of the above channels, the number of expected candidates is estimated by normalising to the yield of B + → J/ψK + decays, according to where f x is the hadronisation fraction of the initial-state hadron for the decay mode x, B x is the branching fraction and ε tot x its total selection and trigger efficiency. The efficiencies are estimated from simulation, except for the PID, which is estimated from data control samples as described in Secs. 6.3 and 7.1. The branching fraction, including the hadronisation fraction of the B + c → J/ψµ + ν µ channel, is taken from Ref. [89], while those of B 0(+) → π 0(+) µ + µ − and B 0 → π − µ + ν µ channels are obtained from Refs. [51,90], assuming f u = f d . LHCb measurements for the Λ 0 b → pµ − ν µ and B 0 s → K − µ + ν µ branching fractions [91,92] and hadronisation fractions [61,84] are used. The estimated yields in each BDT region with BDT > 0.25 are shown in Table 8 for Run 1 and Table 9 for Run 2 data.  1.2 ± 0.6 6.01 ± 0.28 Table 9: Expected background yields per BDT region and for BDT > 0.25 with their total estimated uncertainties for Run 2 data.

Measurement of signal branching fractions
The data sample containing B 0 (s) → µ + µ − (γ) candidates is divided into the two datataking periods, Run 1 and Run 2, which are further divided into six subsets based on the BDT response, using the intervals defined in Sec. 4. The branching fractions of the signal decays are determined using an unbinned extended maximum-likelihood fit to the dimuon mass distributions, performed simultaneously on all the subsets. Due to substantial contamination from combinatorial background, the lowest BDT region, 0 ≤ BDT < 0.25, is excluded from the data set but its fraction is taken into account in the total normalisation of the BDT shape. The fit is performed in a mass window of 4900 ≤ m µµ ≤ 6000 MeV/c 2 . The dimuon mass distribution is shown in Fig. 10 for the Run 1 and Run 2 samples in all BDT intervals. The low mass region is populated by the partially reconstructed background and B 0 s → µ + µ − γ decays, while the higher mass region is dominated by combinatorial background.
The probability density functions (PDFs) of the B 0 s → µ + µ − and B 0 → µ + µ − decays are described by DSCB functions, defined in Eq. 4, with their parameters Gaussianconstrained to the values measured in Sec. 5. The mass distribution of the B 0 s → µ + µ − γ decay is described using an empirical threshold function where the parameters a, b and s, are determined from simulation, which is based on the theoretical predictions and form factors of Ref. [18]. The parameter b is found to be close to 0.5, while the other parameters vary across the BDT regions. This threshold function is convolved with a Gaussian resolution function which models the effect of the detector resolution. The parameter values are estimated from kinematically weighted simulated events, and fixed in the fit. While the branching fraction prediction for the B 0 s → µ + µ − γ decay is dependent on the exact form-factor parametrisation used [19], the distribution of the dimuon mass at high q 2 is found to not depend significantly on the choice of the form-factor, and so the same threshold function can be used for a range of scenarios. Moreover, varying the detector resolution parameter within the known uncertainties has a negligible effect on the yield of B 0 s → µ + µ − γ decays. The combinatorial background is modelled with a single exponential function with an independent yield in each BDT region but with common slope parameters for each data-taking period. Both the yields and the parameters are free to float in the fit. The B 0 (s) → h + h − and semileptonic b-hadron contributions are described using kernel estimation techniques [93] applied to simulated events in each BDT region. Their expected yields in each BDT region are Gaussian-constrained according to the values reported in Sec. 7. Moreover, common parameters, such as the yields of the normalisation channels, efficiencies and branching fractions are shared across all BDT regions and their values are Gaussian-constrained to their estimated values and uncertainties.
The result of the fit in each subset is shown in Fig. 10. The resulting branching fractions of the B 0   The statistical uncertainties are evaluated by repeating the fit with all nuisance parameters fixed to the value obtained in the standard fit, where all nuisance parameters are free to float within their constraints. The systematic uncertainties are then computed by subtracting in quadrature the statistical uncertainties from the total ones. The main contribution to the systematic uncertainty of B(B 0 s → µ + µ − ) originates from the knowledge of f s /f d (3%), while that of B(B 0 → µ + µ − ) is dominated by the knowledge of the B 0 (s) → h + h − and semileptonic b-hadron background (9%). The correlation between the B 0 → µ + µ − and B 0 s → µ + µ − branching fractions is found to be −11%. The correlation between the Two-dimensional profile likelihoods are evaluated in the plane of the possible combinations of two branching fractions. These are obtained by taking the ratio of the likelihood value of a fit where the parameters of interest are fixed and the likelihood value of the standard fit. The results are shown in Fig. 11.
The difference between the logarithm of the likelihood values under the presence or the absence of a specific signal component is used to evaluate the significance with Wilks' theorem [94]. The B 0 s → µ + µ − signal exceeds the background-only hypothesis more than 10 standard deviations (σ), while the statistical significance of the B 0 → µ + µ − signal is 1.7 σ and the B 0 s → µ + µ − γ signal is compatible with the background-only hypothesis within 1.5 σ. Since no evidence for B 0 → µ + µ − and B 0 s → µ + µ − γ decays is found, upper limits on their branching fractions are evaluated using the CL s method [95] with a onesided test statistic [96] as implemented in Refs. [97,98]. The one-sided test statistic for a given branching fraction value is defined as twice the negative logarithm of the profile likelihood ratio if it is larger than the measured branching fraction and zero otherwise. Its distribution is determined from pseudoexperiments, where nuisance parameters are set to their best fit values for toy generation while central values of the Gaussian-constraints are independently fluctuated within their uncertainty for each pseudoexperiment as described in Ref. [99]. The CL s curves are shown in Fig. 12 from which the limit on the B 0 → µ + µ − and B 0 s → µ + µ − γ branching fractions are found to be at 90% (95%) CL. The measured upper limits are shown in Fig. 12, together with the expected ones.
To quantify the impact of the B 0 s → µ + µ − γ component on the other signal modes, the fit is repeated by fixing its branching fraction to zero. Using this configuration, B(B 0 s → µ + µ − ) increases by 2% while the limit on B(B 0 → µ + µ − ) decreases by 12%. As described in Sec. 5.2, the BDT calibration of B 0 s → µ + µ − decays depends on the effective lifetime which introduces a model dependence in the measured time-integrated branching fraction. In the fit, the SM value of A µµ ∆Γs = 1 is assumed for B 0 s → µ + µ − and B 0 s → µ + µ − γ decays. The model dependence is evaluated by repeating the fit under the assumptions A µµ ∆Γs = 0 and −1, finding an increase of the B 0 s → µ + µ − branching fraction with respect to the SM hypothesis of 4.7% and 10.9%, respectively. On the contrary, the B 0 s → µ + µ − γ branching fraction decreases with respect to the A µµ ∆Γs = 1 hypothesis by 2% and 5% with a neglible impact on its limit. The dependence for B 0 s → µ + µ − and B 0 s → µ + µ − γ is approximately linear in the physically allowed A µµ ∆Γs range. To evaluate    the ratio of the branching fractions R µ + µ − defined in Eq. 3, the fit is modified such that R µ + µ − and B(B 0 s → µ + µ − ) are floating observables, which allows for the cancellation of common uncertainties, while B(B 0 s → µ + µ − γ) is kept as a floating observable. The ratio is found to be where the first uncertainty is statistical and the second systematic. Using the CL s method described above, the upper limit is evaluated to be The effective lifetime of the B 0 s → µ + µ − decay is measured using the same data sample as for the branching fraction measurement but with a slightly different selection, described in Sec. 4. The data are divided into two regions of the BDT classifier response and unbinned extended maximum-likelihood fits are performed to the dimuon mass distribution in each region in order to calculate weights using the sPlot method [100]. These weights are then used to extract the B 0 s → µ + µ − signal decay-time distributions. Finally, the effective lifetime is measured using an unbinned maximum-likelihood fit to the weighted decay-time distributions, performed simultaneously to both BDT regions.
The fits to the dimuon mass used to extract the weights are performed in the range 5320 ≤ m µµ ≤ 6000 MeV/c 2 . The lower limit of 5320 MeV/c 2 removes the low mass region containing most of the physical background, including B 0 → µ + µ − and B 0 (s) → h + h − decays, so only B 0 s → µ + µ − decay and combinatorial background components are included in the fit. Residual contamination from physical background in the fit region is low and is treated as a source of systematic uncertainty. The B 0 s → µ + µ − signal is modelled using a DSCB PDF and the background with an exponential PDF. The parameters of the signal PDF are determined using the same method as for the branching fraction (Sec. 8) and are fixed in the fit, while the signal and background yields and the decay constant of the combinatorial background exponential are allowed to float freely. The distributions of the dimuon mass in the two BDT regions are shown in Fig. 13. To make an unbiased measurement of the effective lifetime, the decay-time dependence of the combined trigger, reconstruction and selection efficiency must be accurately estimated. This decay-time acceptance is calculated using simulated B 0 s → µ + µ − candidates, which have been weighted in order to improve agreement with data. The parameters of the functions used to model the decay-time efficiency are extracted using unbinned maximum-likelihood fits to the decay-time distributions of simulated B 0 s → µ + µ − candidates, with the effective lifetime fixed to its true value. Since the decay-time efficiency has a different form in the two BDT regions, two different empirical functions are used. In the low BDT region the efficiency is modelled as where Erf is the error function, t is the reconstructed decay time, a, b, c, d and e are free parameters and ε(t) = 0 when t < 0.26 ps. The acceptance in the high region is modelled using where f, g and t 0 are free parameters and ε(t) = 0 when t ≤ t 0 . The forms of these functions with respect to the B 0 s meson decay time are shown in Fig. 14. The different behaviour in the two intervals reflect the positive correlation of the B 0 s -meson decay time with the BDT response, so that the low (high) BDT region contains more signal decays with small (large) decay times.
Finally, the B 0 s → µ + µ − effective lifetime is determined using a simultaneous fit to the background-subtracted decay-time distributions in the two BDT regions in data, where the decay-time distributions are modelled by the acceptance functions above multiplied by an exponential function. Only the effective lifetime is allowed to float freely in the fit, while the parameters of the acceptance function are Gaussian constrained to the results of the fits to simulation.
Pseudoexperiments are used to evaluate several systematic effects that have the potential to bias the measurement. The fit procedure is found to return an unbiased estimate of the lifetime to a precision of 0.009 ps and good coverage. The effects of residual contamination from physical background, predominantly B 0 (s) → h + h − and Λ 0 b → pµ − ν µ decays, is found to introduce a bias of around 0.012 ps. The effect of the decay-time acceptance on the mixture of the light and heavy mass eigenstates is evaluated and found to be negligible [32]. A further source of uncertainty is related to the decay-time distribution of the combinatorial background, which is unknown a priori and can bias the lifetime measurement if the background lifetime is significantly longer than that of the signal. The decay-time distribution of combinatorial background in the dimuon sample cannot be determined directly from data due to the very small number of candidates and so instead the decay-time model used in the pseudoexperiments is taken from a fit to the decay-time distributions of candidates in the high mass region of the higher yield di-hadron sample, which includes both a short and long lived component. A systematic uncertainty is evaluated by fluctuating the mean lifetimes of both components upwards by 1 σ, which results in an overall bias on the B 0 s → µ + µ − effective lifetime of around 0.003 ps. The effect of a production asymmetry between B 0 s and B 0 s mesons [101] is found to be small, at 0.002 ps. The effects of ignoring the detector decay-time resolution and also the choice of signal mass PDF are evaluated and are both found to be negligible. Any correlation between mass and decay-time is found to be negligible, as required by the sPlot method.
The entire procedure used to measure the lifetime is cross-checked by performing measurements of the lifetimes of the B 0 → K + π − and B 0 s → K + K − decays, which have much larger branching fractions and have already been precisely measured. Candidates are selected using similar requirements to those used to select B 0 s → µ + µ − decays, with a few differences. While the efficiency of the B 0 s → µ + µ − trigger selection is independent of decay time, this is not true for B 0 (s) → h + h − decays where flight distance requirements are imposed on candidates. In order to match the B 0 s → µ + µ − trigger requirements as closely as possible, B 0 (s) → h + h − events are selected with TIS requirements, eliminating any dependence of the trigger efficiency on decay time. Furthermore, different requirements on particle identification information are imposed in order to separate the K + π − and K + K − final states. The decay-time acceptance is evaluated using weighted simulation in the same way as for B 0 s → µ + µ − and the fit procedure is identical apart from some differences in the fit ranges and the inclusion of a B 0 s → K − π + component in the B 0 → K + π − fit. The fits to the B 0 s → K + K − and B 0 → K + π − mass and weighted decay-time distributions are 5100 5200 5300 5400 5500 ] where the uncertainties are statistical only. Systematic uncertainties on these crosscheck measurements are not evaluated. The results are in agreement with the values measured previously by the LHCb collaboration of τ B 0 s →K + K − = 1.407 ± 0.016 ps and τ B 0 →K + π − = 1.524 ± 0.011 ps [102]. The measurements presented here are performed on data samples with very little overlap with those used to make the measurements published in Ref. [102] due to different data-taking periods and trigger requirements. The statistical uncertainty on the measured B 0 s → K + K − lifetime is taken as the systematic uncertainty associated with the use of simulated events to determine the B 0 s → µ + µ − acceptance function.
A summary of the systematic uncertainties is reported in Table 10. Finally, the effective B 0 s → µ + µ − lifetime is measured with the fits to the decay-time distributions shown in Fig. 17, as τ µ + µ − = 2.07 ± 0.29 ± 0.03 ps,  where the first uncertainty is statistical and the second systematic. While this value lies above the physical range defined by the lifetimes of the light (A ∆Γ = −1) and heavy (A ∆Γ = 1, predicted by the SM) mass eigenstates, which are τ L = 1.423 ± 0.005 ps and τ H = 1.620±0.007 ps [51], it is consistent with these lifetimes at 2.2σ and 1.5σ, respectively.

Conclusions
In summary, the full Run 1 and Run 2 data sample of the LHCb experiment was analysed to measure the B 0 s → µ + µ − branching fraction and effective lifetime and to search for the B 0 → µ + µ − and B 0 s → µ + µ − γ decays. The branching fractions of the B 0 s → µ + µ − , B 0 → µ + µ − and B 0 s → µ + µ − γ decays are measured to be where the first uncertainty is statistical and the second systematic. The systematic uncertainty on the B 0 s → µ + µ − decay is significantly reduced compared to previous measurements thanks to a new precise value of the hadronisation fraction ratio and a more precise calibration of the BDT response and of the particle misidentification rate. The B 0 s → µ + µ − branching fraction is the most precise single-experiment measurement to date.
The B 0 → µ + µ − and B 0 s → µ + µ − γ signals are not statistically significant, and consistent with the background-only hypothesis at 1.7 and 1.5 σ level, respectively. Therefore, upper limits on the branching fractions are set to B(B 0 → µ + µ − ) < 2.6 × 10 −10 B(B 0 s → µ + µ − γ) < 2.0 × 10 −9 at 95% CL, the latter with m µµ > 4.9 GeV/c 2 . The limit on the B 0 → µ + µ − decay is the most stringent to date from a single experiment. An upper limit on the B 0 s → µ + µ − γ branching fraction is determined for the first time. This limit only constrains the high-q 2 region of this decay and no attempt is made here to extrapolate the result to the full branching fraction.
Using the same data sample, with a slightly different selection, the effective lifetime of the B 0 s → µ + µ − decay is found to be τ µ + µ − = 2.07 ± 0.29 ± 0.03 ps, where the first uncertainty is statistical and the second systematic. All the results are compatible with the predictions of the SM and with previous measurements of these quantities. In particular, in the two-dimensional space of B 0 s → µ + µ − and B 0 → µ + µ − branching fractions the compatibility is at one standard deviation level. These results significantly constrain possible contributions to these decays from new interactions that would cause effective scalar, pseudoscalar or axial-vector currents, and thus limit the parameter space of new physics models. [