Measuring source properties and quasi-normal-mode frequencies of heavy massive black-hole binaries with LISA

The laser-interferometer space antenna (LISA) will be launched in the mid 2030s. It promises to observe the coalescence of massive black-hole (BH) binaries with signal-to-noise ratios (SNRs) reaching thousands. Crucially, it will detect some of these binaries with high SNR both in the inspiral and the merger-ringdown stages. Such signals are ideal for tests of General Relativity (GR) using information from the whole waveform. Here, we consider astrophysically motivated binary systems at the high-mass end of the population observable by LISA, and simulate their signals using the newly developed multipolar effective-one-body model: \texttt{pSEOBNRv5HM}. The merger-ringdown signal in this model depends on the binary properties, and also on parameters that describe fractional deviations from the GR quasi-normal-mode (complex) frequencies of the remnant BH. Performing full Bayesian analyses, we assess to which accuracy LISA will be able to constrain deviations from GR in the ringdown signal when using information from the whole signal. We find that these deviations can typically be constrained to within $10\%$ and in the best cases to within $1\%$. We also show that with this model we can measure the binary masses and spins with great accuracy even for very massive BH systems with low SNR in the inspiral. We also probe the accuracy of the \texttt{SEOBNRv5HM} waveform family by performing synthetic injections of GR numerical-relativity waveforms. Using a novel method that we develop here to quantify the impact of systematic errors, we show that already for sources with SNR $\mathcal{O}(100)$, we would measure erroneous deviations from GR due to waveform model inaccuracies. One of the main sources of error is the mismodelling of the relative alignment between harmonics. Our results confirm the need for improving waveform models to perform tests of GR with high SNR binary BHs observed by LISA.


I. INTRODUCTION
We are now well into the era of gravitational-wave (GW) astronomy, with 90 observations of compact-object binaries [1,2] by the LIGO-Virgo-KAGRA (LVK) Collaboration [3][4][5] and other claimed detections [6,7].The fourth observing run of the LVK Collaboration has just started, with the promise of many new detections thanks to improved sensitivity [8].In addition to unveiling an otherwise hardly detectable population of binary black holes (BBHs) [9][10][11][12][13], constraining the equation of state of neutron stars [14,15] and inferring astrophysical and cosmological information [16,17], GWs allow us to test general relativity (GR) [18][19][20][21] in the stronggravity and high-velocity regime, which is not accessible to other experiments.Indeed, by comparing predictions for the GW signal of a BBH within GR to the observed data we can constrain deviations from GR.
One of the most promising approaches to probe deviations from GR with GWs are the so-called "ringdown tests".In the last stage of the coalescence of a BBH, after the two BHs have merged, the remnant BH is in a perturbed state and relaxes to a steady state configuration through GW emission.This stage is called the ringdown.In this final stage, the signal is a superposition of damped sinusoids with frequencies and damping times that depend exclusively on the properties of the remnant [22][23][24][25][26][27].Within GR, the "no-hair" conjecture [28] tells us that those are the mass and the spin of the final BH, since astrophysical BHs are expected to carry no electric charge.Some gravity theories predict additional "hairs" for BHs, due, for instance, to cosmological boundary conditions or the presence of nearby matter [29][30][31] or to additional fields [32][33][34][35][36][37][38][39].In any case, the exact relation between the properties of the remnant and the spectrum of quasinormal modes (QNMs) (i.e., the sets of frequencies and damping times) is theory dependent [40][41][42][43][44][45][46][47][48][49][50][51][52][53][54].By measuring two or more QNMs we can test if the signal agrees with GR.This is the basic idea behind BH spectroscopy [55,56].On the other hand, the amount to which each mode is excited (i.e., its amplitude) and the relative phases between them do depend on the properties of the BHs in the binary and the binary dynamics [57][58][59][60].Therefore, a consistent modelling of the merger-ringdown together with the inspiral can improve our ability to measure the QNMs, and to constrain deviations from GR during the ringdown.This is the approach followed in Refs.[61,62], where the authors developed a parametrised model of the ringdown signal as part of the full inspiral-merger-ringdown (IMR) waveforms [63,64] in the effective one-body (EOB) formalism [65,66].Such a model can be used to perform parametrised (or theory-agnostic) ringdown tests of GR by allowing the QNMs to deviate from their GR prediction: A departure from the Kerr spectrum would be indicative of non-GR effects.The model has also been extended to parametrise the plunge-merger stages in Ref. [67], and to carry out theoryspecific tests of GR in the ringdown in Ref. [68].Here, we employ the parametrised ringdown test, which has already been applied to analyse the GW signals observed by the LVK Collaboration, showing so far consistency with GR [20,21,61,62,67].The precision of the test has so far been limited by the low signal-to-noise ratio (SNR) of the sources, which has led to measurement errors for the frequency and arXiv:2307.15086v3[gr-qc] 20 May 2024 decay time of the dominant QNM on the order of 10% and 20%, respectively, when combining events in a hierarchical way [21].More specifically, for LIGO-Virgo observations this test can be applied only when both the pre-and postinspiral regimes have at least SNR ∼ 8, which has been the case for 12 binary systems [21].The best single-event measurement [20] has been obtained with GW150914, which has a total SNR of 24 [69].Other approaches have also been developed to do BH spectroscopy with LIGO-Virgo data [20,21], using a superposition of damped sinusoids [70,71], in some cases augmented with QNM amplitudes calibrated to numerical-relativity (NR) simulations.
Scheduled for launch in the mid-2030s, the Laser Interferometer Space Antenna (LISA) [72] will detect massive BH binaries (MBHBs) with SNRs reaching thousands, sometimes both in the inspiral and in the merger-ringdown [73][74][75].MB-HBs are therefore promising candidates for performing ringdown tests that use information from the full signal.Previous studies on ringdown analysis with LISA [58,74,76] focused on "pure" ringdown tests (i.e., using a superposition of damped sinusoids after the merger) and employed simplified methods and criteria such as the Fisher matrix formalism [77,78] to estimate the measurement accuracy and the distinguishability between QNMs.In this paper, we simulate LISA observations of MBHBs and run full Bayesian analyses on those in order to assess to which accuracy putative deviations from GR in the ringdown could be constrained from such observations when using information from the whole signal.Furthermore, we make use of parametrised EOB waveforms developed using the state-of-the-art multipolar aligned-spin model SEOBNRv5HM1 developed in Refs.[79][80][81][82].Henceforth, we denote the parametrised model as pSEOBNRv5HM.Crucially, this model includes higher harmonics, which are expected to play an important role in LISA parameter estimation [83].
Since the expected population of MBHBs is highly uncertain [73,[84][85][86][87][88][89], here we focus on a few astrophysically realistic systems, compatible with the predictions of models where MBHs form from heavy seeds [90].For simplicity, in this work we neglect the effect of spin precession and eccentricity.Our study is performed by using the same waveform model for generating mock injections and for estimating the parameters of the source.However, it is crucial to assess if such tests of GR could be spoiled by the limited accuracy of our theoretical models when performed on real data.Therefore, we assess the impact of systematics in waveform modelling on ringdown tests by simulating mock LISA injections with waveforms from NR and using pSEOBNRv5HM waveforms to perform the Bayesian analysis.Developing a new approach, we estimate from which SNR we expect to erroneously measure deviations from GR due to systematic effects.Finally, we assess to which extent a consistent modelling of the full signal allows us to measure the binary parameters in GR also for systems that are merger-ringdown dominated and have low SNR in the inspiral.
This paper is organised as follows.In Sec.II we present the details of our parametrised EOB model, describe how the synthetic LISA observations are generated, and lay down the basis of our Bayesian analyses.In Sec.III we summarise the astrophysical systems that we simulate.We present our results when using the pSEOBNRv5HM model for both injection and parameter estimation in Sec.IV; then in Sec.V we discuss the impact of systematics and using a novel method, we show that our models are not yet accurate enough for the signals we expect with LISA.Finally, we present our conclusions in Sec.VI.In the appendices, we discuss how the settings of pSEOBNRv5HM waveforms impact the parameter estimation, and we show that, in GR, SEOBNRv5HM and IMRPhenomTHM [91], a time-domain waveform model from the IMR phenomenological family [92][93][94], predict similar measurement errors for the parameters of the source, that their measurement is little affected by adding the QNM deviation parameters in the pSEOBNRv5HM model, and show additional results of our study of systematic effects.Throughout this paper we will use natural units in which c = G = 1.

A. Parametrised waveform model
We consider a binary with BH component (detector-frame) masses m 1 and m 2 and define the mass ratio q = m 1 /m 2 ≥ 1 and the total mass M t = m 1 + m 2 .We limit to BHs moving on quasicircular orbits with aligned or anti-aligned spins (aligned spins for short) and define the (dimensionless) spin variables χ 1 = S 1 /m 2 1 and χ 2 = S 2 /m 2 2 , which range between −1 and 1.We denote the luminosity distance of the source as D L and the cosmological redshift as z.We adopt the cosmology determined by the Planck mission (2018) [95].Masses, times, and frequencies are in the detector frame, unless they carry a subscript s.Source-frame masses m 1,s and m 2,s are related to the detector-frame ones by The GW polarisations can be expanded in the basis of spinweight −2 spherical harmonics as ) where the parameters (ι, φ 0 ) denote the binary's inclination angle with respect to the direction perpendicular to the orbital plane and the azimuthal direction to the observer, respectively, and Θ denotes the intrinsic parameters (masses and spins) of the binary.We build our parametrised model using the SEOBNRv5HM model [81] in GR, which includes several higher harmonics, notably the (ℓ, |m|) = (2, 1), (3,3), (3,2), (4,4), (4,3) and (5,5) harmonics, in addition to the dominant (2, 2) harmonic.For aligned-spin binaries, h ℓm = (−1) ℓ h * ℓ−m , therefore, we focus on (ℓ, m) harmonics with m > 0.
In the EOB framework [66], the GW harmonics are decomposed as where θ(t) is the Heaviside step function, h insp−plunge ℓm corresponds to the inspiral-plunge part of the waveform, while h merger−RD ℓm represents the merger-ringdown waveform.In particular, as explained in Ref. [81], t ℓm match is chosen to be the peak of the (2, 2) harmonic amplitude for all (ℓ, m) harmonics except (5,5), for which it is taken as the peak of the (2, 2) harmonic minus 10M t .In the following, we suppress the Θ dependence for ease of notation.
For all harmonics, except for (ℓ, |m|) = (3, 2) and (4, 3) which exhibit postmerger oscillations due to mode mixing [96,97], the merger-ringdown waveform employs the following ansatz [64,81,98], where ν = m 1 m 2 /(m 1 +m 2 ) 2 is the symmetric mass ratio of the binary and σ ℓm0 is the complex frequency of the least-damped QNM, having overtone number zero, of the remnant BH.We define the corresponding oscillation frequency f ℓm0 and the damping time τ ℓm0 respectively as (2.4b) The functions Ãℓm (t) and φℓm (t) are given by [64,81,[98][99][100]: where ϕ ℓm match is the phase of the inspiral-plunge harmonic (ℓ, m) at t = t ℓm match .The coefficients d ℓm 1,c and c ℓm i,c (i = 1, 2) are constrained by the requirement that the amplitude and phase of h ℓm (t) are continuously differentiable (C 1 ) at t = t ℓm match .This allows us to write the coefficients c ℓm i,c as [64, 81] where ω insp-plunge ℓm (t) is the frequency of the inspiral-plunge EOB harmonic.The coefficients c ℓm i, f and d ℓm i, f are obtained through fits to a large set of NR waveforms (∼ 440), spanning mass ratios up to 20 and spins up to 0.998, and BH perturbation-theory merger-ringdown waveforms for mass ratio 1000.Crucially, the fits depend on the binary's masses and spins Θ and can be found in Appendix D in Ref. [81].
As an example, we illustrate in Fig. 1 how the GW amplitude and frequency of the (2, 2) harmonic changes, during the late inspiral, merger and ringdown, as the component spins are varied, for a binary with mass ratio 2 and equal spins.
For the (3, 2) and (4, 3) harmonics, the mode-mixing behaviour is modelled by applying the previous construction to the spheroidal harmonics [101] (3, 2, 0) and (4, 3, 0), which feature a monotonic amplitude and frequency evolution [102].The spheroidal (3, 2, 0) and (4, 3, 0) harmonics can be related to the spherical harmonics by [ Though overtones are not explicitly included in the mergerringdown signal ansatz, unlike older versions of the model [63,104], their effect should be captured by the functions Ãℓm (t) and φℓm (t).They contain free coefficients fitted against NR simulations and allow our ansatz to be more than a simple FIG. 2. Amplitude in frequency domain in the TDI channel A broken into the contributions of each of the harmonics included in the pSEOBNRv5HM model, Ãℓm .The latter is computed using the harmonic decomposition of Eq. (2.1) in the TDI equations (2.10a).The thin black line shows the LISA PSD and the black dashed lines the GW frequency at t c (i.e., the separation between the inspiral and mergerringdown regimes).We recall that t c is defined from the peak of h 22 , and that the computation of the TDI variables involves second derivatives of the GW polarisation, leading to an offset between the maximum of h 22 and those of A and E.Moreover, these quantities are defined in time-domain, whereas we show here the frequency-domain amplitudes, and the time-domain peak typically corresponds to lower frequencies than the frequency-domain one.We plot the amplitudes only for M t,0 = 2 × 10 7 M ⊙ and z 0 = 2.2.For systems at z 0 = 3.7, one should simply rescale the amplitudes by 18, 331/33, 691 ≃ 0.54, and for systems with M t,0 = 2 × 10 8 M ⊙ , one should multiply the amplitudes by 10 2 and the frequencies by 10 −1 .As expected, the (2, 2) harmonic is the loudest, followed by the (3, 3), (4, 4) and (5, 5) harmonics.For high-spins systems (upper row), the (3, 2) and (4, 3) harmonics are louder than the (2, 1) harmonic, whereas the opposite is true for low-spin systems (lower row).
damped sinusoid with damping time and frequency given by the fundamental QNM.Moreover, we do not expect the linear perturbation description used in the ringdown to be valid right at the time used to transition from inspiral to mergerringdown [105], which for most modes is the peak of the (2, 2) harmonic.Thus, the choice of modelling in SEOBv5NRHM allows one to better capture such nonlinearities without having to include a transition phase.
In the SEOBNRv5HM model constructed in Ref. [81], the complex QNM frequencies in GR are obtained for each (ℓ, m) harmonic as a function of the BH's final mass and spin using the qnm Python package [106].The BH's mass and spin are, in turn, computed using the fitting formulas of Refs.[107,108] respectively.In this work, following the strategy of Refs.[61,62,67,71,[109][110][111], we introduce parametrised fractional deviations to the QNM frequencies, which are free parameters of the model (see Ref. [68] where the deviations were mapped to specific gravity theories alternative to GR).More explicitly, we perform the substitutions where for ease of notation we have dropped the zero overtone subscript in the deviation parameters.We shall denote this parametrised model as pSEOBNRv5HM.We note that allowing σ ℓm0 to vary freely also modifies the c ℓm i,c and d ℓm 1,c coefficients in Eqs.(2.6a), (2.6b), and (2.7), which enter the amplitude and phase functions Ãℓm (t) and φℓm (t).As a consequence, such a modification can lead to deviations from the GR prediction in the ringdown signal starting soon after the merger.The plunge-merger stage of the waveform could be, in principle, also modified, as done, for example, in Ref. [67], by introducing deviations with respect to the GR predictions to the time at which the amplitude peaks and to the value of the amplitude and frequency at this instant, for each waveform harmonic.
Finally, the inspiral-plunge EOB waveforms (2.2) are computed based on the two-body dynamics that are computed by solving Hamilton's equations with a suitable EOB Hamiltonian and radiation-reaction force (see Refs. [80,81] for details).

B. Generation of LISA signals
We use the long-wavelength approximation [112] to compute the response of LISA to an incoming GW, which is valid when the GW wavelength is much larger than the LISA arm length L (i.e., in terms of the GW frequency, when 2π f L/c ≪ 1).Given that L = 2.5 × 10 8 m, this condition is satisfied for sources reaching maximum frequencies of ∼ 10 −3 Hz, such as the MBHBs we consider in this work.Under this approximation, LISA is somewhat similar to two LIGO or Virgo-type detectors rotated with respect to each other by π/4 and with angles of π/3 between the arms.
Transforming Eq. ( 47) in Ref. [113] to the time domain, we find that under the long-wavelength approximation the time-delay-interferometry (TDI) variables A, E, and T [114] (which, for an interferometer with equal arms and equal noise levels in each optical link, provide three noise-uncorrelated datasets) are given by where F + and F × are the antenna pattern functions: F ×,0 (λ, β) = sin 2 β sin(2λ − π/3). (2.11d) In the above equations, λ, β, and ψ are the longitude, latitude and polarisation in the LISA frame, respectively.We refer to Ref. [113] for the relation between the angles in the LISA frame and those in the solar-system-barycentre frame.As anticipated, this is similar to the response of ground-based detectors.The main difference is that it is the second derivatives of the waveform polarisations that enter Eq. (2.10a).This comes as a consequence of taking waveform differences in order to perform TDI, with a time step that goes to zero in the longwavelength approximation.Finally, because most of the SNR of the signals we consider is accumulated in the last stages of the evolution (i.e., from a few hours to a few days), we do not take into account the motion of LISA about the Sun.Therefore, λ, β, and ψ in Eq. ((2.10a)) are not varying with time.Henceforth, we will use Ã and Ẽ to denote the Fourier transform of A and E.
We generate EOB waveforms from the frequency f gen = 5 × 10 −5 [M t,0 /(2 × 10 7 M ⊙ )] Hz until the end of the signal.After transforming to the frequency domain, we keep the portion of the signal between f min = 2 × 10 −4 [M t,0 /(2 × 10 7 M ⊙ )] Hz and f max , to eliminate spurious features due to Fourier transform.The maximum frequency is chosen such that the frequency-domain amplitude is 1% of its maximum value.We verified that our choice of f min leads to a loss in SNR of less than 2%.
The time and phase alignment of the signals is done in the following way.We define the time to coalescence t c as the moment the amplitude of the (2, 2) harmonic reaches its peak and define the phase of coalescence φ c as the phase of the (2, 2) harmonic contribution to the total waveform at t c .In practice, the last step is done by choosing the azimuthal angle φ 0 such that the phase of −2 Y 22 (ι, φ 0 ) h 22 (t c ) is φ c .This choice of φ 0 is then propagated consistently to other harmonics.We use t c to split between the inspiral and merger-ringdown regimes.We note that t c for EOB waveforms coincides with t ℓm match in Eq. (2.2), for all (ℓ, m) harmonics except (5,5).

C. Bayesian analysis
We define the noise-weighted inner product between two data streams d 1 and d 2 as where S n ( f ) is the power spectral density (PSD).In this work, we use the SciRDv1 noise curve [115], which corresponds to the scientific requirement for the LISA mission and defines pessimistic noise levels compared to current predictions.For a given choice of the PSD, the SNR of a signal h is defined as SNR = √ (h|h).To quantify the precision with which LISA observations will estimate the parameters of a source, we work in a Bayesian framework and compute the posterior distribution on the source parameters, θ, given an observed dataset d, using Bayes' theorem: where p(d|θ) is the likelihood, p(θ) is the prior and p(d) is the evidence.As long as we are not interested in model selection, the latter acts as a normalisation constant, and, thus, can be discarded.We take the prior to be flat in the (detector-frame) total mass M t , the mass ratio q, the spins −1 ≤ χ 1 ≤ 1 and −1 ≤ χ 2 ≤ 1, the time to coalescence t c , and the phase at coalescence φ c .For the systems we consider here, the intrinsic parameters M t , q, χ 1 , and χ 2 are typically well measured, so that the actual priors have little importance.We take a flat prior on ψ, cos(ι), and log 10 (D L ) and fix the sky location (λ, β) to its true value to facilitate the convergence of the chains.Those parameters are not expected to correlate strongly with intrinsic parameters [113], at least for aligned-spin binaries, and so this simplification should not significantly affect our conclusions.Finally, we take a flat prior between −1 and 1 for the fractional deviations to the QNMs, δ f ℓm and δτ ℓm .Assum- ing noise to be stationary and Gaussian, the likelihood reads The posterior distribution is then sampled via a Markov-chain Monte Carlo algorithm (MCMC).We use the Eryn sampler [116,117] for this purpose.

III. ASTROPHYSICAL BINARY SYSTEMS
We consider a set of 16 binary systems, defined from all possible combinations of the following choices of parameters: • q 0 = 2 or q 0 = 4, • χ 1,0 = χ 2,0 = 0.9 or χ 1,0 = 0.2, χ 2,0 = 0.1, and Subscripts 0 indicate the true value of the parameter used to generate the synthetic injections.This set of systems lies in the high-mass end of predictions for the population visible to LISA, as predicted from semianalytic models of MBH populations that use heavy seeds for the MBH progenitors [73,[84][85][86][87][88][89].Different heavy-seed scenarios have been proposed, such as the collapse of protogalactic disks as a result of bar instabilities [118], the runaway collision of stars at the centre of galaxies [119], or the direct collapse of gas at the centre of galaxies [120] (see Ref. [90] for a review).Such heavy systems are the ones expected to have higher SNR in the merger-ringdown [74,75], and are, therefore, the most relevant ones to our analysis.In particular, very heavy systems (M t,0 = 2 × 10 8 M ⊙ ) are expected to have very little SNR in the inspiral, and it is interesting to assess how well the parameters of such systems can be measured.Focusing on such massive systems is even more well motivated following the latest results from pulsar-timing-array observations [121][122][123][124].If the apparent signal in the pulsar-timing-array data is generated by massive black hole inspirals, it indicates that MBHs might be more massive than originally expected.Semianalytical models predict a wide range of values for the mass ratio, but the vast majority of systems are predicted to have comparable masses.This is also the domain where our current IMR models are the most reliable.The spin of a MBH typically depends on the amount of gas surrounding it and on how it has acquired mass and angular momentum through accretion [125].We consider two possibilities in order to cover both the case where MBHs are efficiently spun up and the case they are not.The relative alignment of the spins in a MBHB also depends on the presence of gas around the binary.Mergers happening in a gas-rich environment tend to have aligned spins due to the Bardeen-Peterson effect [126,127].Binaries formed through triplet interactions can also have misaligned spins, in addition to having high eccentricity [86].As discussed, we neglect eccentricity and spin precession here for simplicity and focus on quasicircular binaries.Exploring different values for q, χ 1 , and χ 2 is interesting, because they affect how much higher harmonics are excited and, therefore, how well it is possible to constrain the QNMs other than the fundamental one.Finally, the redshifts at which MBHBs coalesce depend on when MBH seeds form and on the different timescales at play during the hardening of the binary [128][129][130].In heavy-seed models, MBHBs are expected to merge dominantly at late times (i.e., low redshift).Since M t is the detector-frame total mass, changing the redshift affects only the SNR of the system.Based on the predictions of semianalytical models [73,[84][85][86][87][88][89], we expect to observe up to a few tens of systems similar to the ones we defined during the nominal mission duration (four and a half years).For all binary systems we take ι 0 = π/3, ψ 0 = π/3, λ 0 = π/3, β 0 = π/3, and φ c,0 = 0.
In Fig. 2, we plot the frequency-domain amplitude of the TDI variable A for the systems with M t,0 = 2 × 10 7 M ⊙ and z 0 = 2.2 and the four combinations of mass ratio and spins.We show the contribution of each harmonic, Ãℓm , using Eq.(2.1) when computing the TDI variables [see Eq (2.10a)].The black dashed lines indicate the GW frequency at t c , which we choose as the separation between the inspiral and mergerringdown regimes.As expected, the (2, 2) harmonic is the loudest, but higher harmonics are also important, in particular, the (3, 3), (4, 4) and (5,5).This is better quantified in Fig. 3, where we show the total SNR and the contribution of each harmonic.We observe that the relative importance of the subdominant (2, 1), (3, 2) and (4, 3) harmonics depends primarily on the spins: (2, 1) is more dominant for low-spin systems, whereas (3, 2) and (4, 3) are more dominant for highspin systems.We note the very high SNR of some of these systems, reaching a few thousands.Figure 3 also shows the inspiral SNR, defined by using the GW frequency at t c as the upper limit in the integral in Eq. (2.12).As anticipated, the signals of the systems we consider are merger-ringdown dominated.In Table I, we give the IMR, merger-ringdown, and inspiral SNR of the systems at z 0 = 2.2.Although systems with M t,0 = 2 × 10 7 M ⊙ (upper panel) still have high SNR also in the inspiral, ∼ 1000, it is not the case for the very massive systems (M t,0 = 2 × 10 8 M ⊙ , lower panel), with inspiral SNRs as low as ∼ 30.

IV. MEASURING SOURCE PROPERTIES AND QNMS WITH MBHB OBSERVATIONS
We work with zero-noise injections, as these are well suited to the goals of understanding systematics and measurement uncertainties [131], and perform three types of analyses using the SEOBNRv5HM and pSEOBNRv5HM models as syntheticsignal injections and templates.
1. We inject a synthetic signal without deviations from GR (i.e., SEOBNRv5HM) and use templates in the Bayesian analyses not allowing for deviations from GR (i.e., SEOBNRv5HM); 2. we inject a synthetic signal without deviations from GR (i.e., SEOBNRv5HM) and use templates in the Bayesian analyses allowing for deviations from GR (i.e., pSEOBNRv5HM); 3. we inject a synthetic signal with deviations from GR (i.e., pSEOBNRv5HM) and use templates in the Bayesian analyses allowing for deviations from GR (i.e., pSEOBNRv5HM).
The first type of analysis will estimate how well the parameters of MBHBs can be constrained assuming GR is correct.It is the first study of this kind using EOB waveforms.The second will tell us how well the deviation parameters of QNMs can be constrained, and the third for which values of the deviation parameters we can detect non-GR effects in the ringdown.We perform these mock injections for all MBHBs described in Sec.III.

A. Measurement of source parameters in GR
We show in Fig. 4 the width of the 90% confidence interval centred around the median for the intrinsic parameters (i.e., the masses and spins) as a function of the SNR of the system.The colour, shape, and filling of the point indicate, respectively, the total mass, spin and mass ratio of the system, indicated in the legends.Each point is doubled because of the two redshifts used: z 0 = 2.2 (z 0 = 3.7) corresponds to the FIG. 4. Measurement error (90% confidence interval centred around the median) of intrinsic parameters.For mass parameters, we show the errors relative to the injection value.We inject signals within GR and recover them with GR templates (i.e., we employ SEOBNRv5HM).All points are doubled because of the two redshift possibilities; z 0 = 2.2 (z 0 = 3.7) yields larger (lower) SNR and lower (larger) errors.We show the detector-frame total mass and the mass ratio in the top row, the source-frame individual masses in the middle one, and the dimensionless spins in the bottom row.At leading order, measurement errors follow the 1/SNR trend.largest (smallest) SNR and the smallest (largest) measurement error.Note that we show the detector-frame total mass in the top row and the source-frame individual masses in the middle one.For all systems the parameters are well constrained, and we find that the error (or relative error for the mass parameters) goes as 1/SNR, as expected in the high SNR regime [77,78,132].This relation is more scattered for the spin parameters, especially for χ 1 : Systems with q 0 = 4 have better spin measurement, in agreement with [75].This is because, for such systems (nonfilled points), higher harmonics become more dominant (see Figs. 2 and 3) and help improve the measurement of the spins.It is remarkable that even for very massive systems (red points), which usually have low SNR in the inspiral, we get tight constraints on their parameters, similarly to what [133] found.This is the benefit of using a fully consistent modelling of the IMR signal, since in our model the merger-ringdown signal also informs us on the parameters of the component BHs in the binary.
For very massive systems with low SNR, we find a multimodality in intrinsic parameters, as illustrated in Fig. 5. Secondary modes arise from combinations of parameters that yield remnant parameters compatible with the true ones within the measurement uncertainty, as shown in the upper-right part in Fig. 5.As a consequence, the merger-ringdown signal re-mains quasi-identical to the synthetic injection.This can be seen in Fig. 6, where we compare the waveform of the injected parameters (in blue in Fig. 5) to the one with parameters from one of the secondary maxima (in red in Fig. 5).The early part of the inspiral signal is fairly different, but this has little importance because the system is merger-ringdown dominated, and the SNR in the inspiral is small compared to the total SNR (∼ 30, see Fig. 3 and Table I).
In Appendix A we discuss the impact of the tolerance of the integrator used to solve the Hamilton equations and compute the EOB waveforms.In Appendix B we compare the measurement errors obtained with pSEOBNRv5HM to the ones obtained with the IMR phenomenological model IMRPhenomTHM [91], finding comparable results.

B. Measurement of QNMs and possible deviations from GR
We now turn our attention to QNM measurements and constraints on deviations from GR using MBHB observations with LISA.FIG. 5. Corner plot for the system with M t,0 = 2 × 10 8 M ⊙ , χ 1,0 = 0.2, χ 2,0 = 0.1, q 0 = 4, z 0 = 2.2, and SNR=167.The blues lines show the injection and the red ones a secondary mode.Contours show the 68%, 90%, and 95% confidence intervals, and dashed lines the 0.05 and 0.95 quantiles.On the upper-right part, we show the inferred properties of the remnant.The multimodality observed in the binary parameters is due to combinations of intrinsic parameters that yield remnant properties compatible with the true values within the measurement uncertainty, as can be seen from the absence of clear multimodality in M f and a f .FIG. 6.Comparison between synthetic-injection waveform and the one at a secondary maximum (same colour code as in Fig. 5).The late inspiral and merger-ringdown signals match almost perfectly.The early inspiral signals are dephased, but the SNR in the inspiral is so low that it makes little difference.

GR injections
First, we consider the case where the injected signal is compatible with GR, and allow for nonzero deviations when running the Bayesian analysis.In Figs. 7 and 8, we show the width of the 90% confidence interval centred around the median on the deviation parameters (i.e., ∆δ f ℓm and ∆δτ ℓm ).We find that deviations to the frequency are generally better constrained than those to the damping time.As a consequence of the higher SNR of the dominant (2, 2), (3,3), (4,4), and (5, 5) harmonics, fractional deviations to their QNMs are better constrained than those to the subdominant (2, 1), (3,2), and (4, 3) harmonics.For the former, δ f ℓm and δτ ℓm are typically constrained within 0.1 and even within 0.01 for the systems with M t,0 = 2 × 10 7 M ⊙ (blue points) and follow the 1/SNR trend, with some scatter for higher harmonics [especially the (3, 3) and (5,5) harmonics] that depends on the mass ratio.Here again, the reason for this is that, for systems with q 0 = 4 (nonfilled points), higher harmonics are more excited (see Figs. 2 and 3), so we are able to better constrain deviations in their QNMs, in agreement with [74].Deviations in subdominant harmonics are poorly constrained for very massive systems (red points), ∆δ f ℓm and ∆δτ ℓm ∼ 1, which given our prior range, translates into uninformative measurements.This is due to the low SNR of these harmonics.However, we get rather good constrains, within 0.1, for systems with M t,0 = 2 × 10 7 M ⊙ (blue points).We also find that for lowspin systems (circles) deviations to the (2, 1) harmonic are better constrained than the ones to the (3, 2) and (4, 3) harmonics, whereas the opposite happens for high-spin systems (squares), in agreement with our remark on their relative predominance in Sec.III.We show the impact of allowing for deviations from GR on the measurement of intrinsic parameters in Appendix C.
We translate these constraints on the fractional deviations into measurements of the QNMs in Fig. 9, where we show some representative examples of 90% confidence regions on the QNMs of all the harmonics included in pSEOBNRv5HM.The upper-left panel shows a best-case scenario, where all QNMs can be perfectly distinguished.It illustrates that applying the Rayleigh criterion [58] to both the damping time and the frequency to decide on the distinguishability of QNMs is too stringent.Indeed, as can be seen from the upper-left panel in Fig. 9, the one-dimensional projection of the 90% confidence regions onto the y axis (damping time) can overlap [e.g., for the (4, 4) and (4, 3) QNMs], although the twodimensional regions are well separated.Thus, one should really consider the two-dimensional regions in order to decide on the distinguishability of QNMs, as pointed out in Ref. [71].However, given the little correlation between τ ℓm and f ℓm , it is often enough to apply the Rayleigh criterion only to the damping time or to the frequency, as suggested in previous works [58,59,134,135].In the cases shown here, the frequency is enough to decide on the distinguishability.The upper-right and lower-left panels illustrate cases where not all QNMs can be resolved.As expected from our comments above, for a high-spin system such as the one shown in the upper-right panel, deviations to the (2, 1) QNM are poorly constrained, so its measurement uncertainty contour encloses that of the (2, 2) mode.Similarly, the lower-left panel shows a low-spin system, for which the (4, 3) QNM measurement uncertainty contour contains the (4, 4) one.Finally, the lowerleft panel shows a worst-case scenario where the QNMs can- (a) M t,0 = 2 × 10 7 M ⊙ , χ 1,0 = χ 2,0 = 0.9, q 0 = 4, z 0 = 2.2, SNR=5174.(b) M t,0 = 2 × 10 8 M ⊙ , χ 1,0 = χ 2,0 = 0.9, q 0 = 4, z 0 = 2.2, SNR=672.not be distinguished due to the large uncertainty on the subdominant harmonics.Note that the system in the lower-left panel has a total mass of 2 × 10 7 M ⊙ , illustrating that the confidence regions of QNMs are not always all well separated for systems with M t,0 = 2 × 10 7 M ⊙ , although they do tend to yield better results than for systems with M t,0 = 2 × 10 8 M ⊙ , as illustrated in Figs.7 and 8.We find some cases of multimodality in the deviation parameters, as illustrated in Fig. 10.They can be understood by looking at the corresponding values of QNMs.Indeed, the frequency of the secondary mode in the (2, 1) QNM matches the frequency of the (3, 2) QNM.Because their damping times are poorly constrained, they are fairly compatible.Thus, this multimodality can be understood as the subdominant harmonics trying to "match" each other.

Non-GR injections
We now consider non-GR injections, and we generate synthetic signals with nonzero deviations to the QNMs.Deviations to GR in the QNM frequencies have been derived in non-GR theories and typically vary in the range 0.01-0.1 or even smaller.In the spherically symmetric case (i.e., nonspinning BHs), they were computed in theories such as Einstein-Maxwell-dilaton [136], dynamical Chern-Simons gravity [42], Einstein-dilaton-Gauss-Bonnet gravity [40,41,137], higher-curvature gravity theories [53], and for some solutions in massive (bi)gravity [138][139][140].Recently, the computation of QNMs for spinning BHs in non-GR theories has received much attention, since the remnant BHs we are observing with LIGO and Virgo have typically spins of about 0.7.They include the Kerr-Newman case in Einstein-Maxwell theory [44][45][46][47], Einstein scalar Gauss-Bonnet gravity [50,51], higher-curvature gravity theories [52,53], and dynamical Chern-Simons theory [54].Estimates for QNMs of spinning BHs in non-GR theories have also used the connection between the light ring and QNMs [41,[141][142][143], which is formally valid only in the eikonal ℓ → ∞ limit, and are known to fail to describe some families of QNMs when additional degrees of freedom are present [41].
Here, we assume the fractional deviation to GR to be 0.01 for all harmonics, for both frequencies and damping times.In Figs.11 and 12, we show the width of the 90% confidence FIG. 10.Reduced corner plot of deviation parameters and QNMs for a GR injection of the system M t,0 = 2 × 10 8 , χ 1,0 = χ 2,0 = 0.9, q 0 = 2, z 0 = 3.7, and SNR=339.We find a multimodality in the δ f 21 posterior, which is the less-well measured harmonic for this system.This leads to a secondary mode in the f 21 posterior at frequencies that match f 32 .Given that the damping times of the two harmonics are poorly measured and their posterior fairly compatible, this suggests that the (2, 1) harmonic is trying to match the (3, 2) harmonic and vice versa.interval centred around the median for δ f ℓm and δτ ℓm .As a rule of thumb, we consider that a deviation can be measured when it is larger than the measurement error.Graphically, this corresponds to the points that are below the black dashed lines in Figs.11 and 12.For this value of the deviation (i.e., 0.01), we find that it could be detected in the frequency of the (2, 2), (3, 3), (4, 4), and (5, 5) harmonics of systems with M t,0 = 2 × 10 7 M ⊙ , and for the higher SNR ones we could also detect this deviation in their damping time.We note that the errors shown in Figs.11 and 12 are very similar to the ones we find when injecting GR signals (see Figs. 7 and 8).We also perform injections with deviations of 0.1 and 0.001 (not shown here), and find again similar errors.Therefore, we can extrapolate the results presented here and read from those figures which values of the deviations would be needed to detect them.For instance, a deviation of 0.1 could be detected for almost all systems presented here, in both the frequency and the damping time of the dominant harmonics (left column), and for the higher SNR systems, even of the subdominant harmonics (right column).Detecting a deviation in several harmonics, preferably in both the frequency and the damping time, would reinforce our confidence that we are truly observing effects in gravity theories alternative to GR.

V. IMPACT OF SYSTEMATICS
In order to assess if the lack of accuracy in waveform modelling could spoil ringdown tests of GR, we generate synthetic injections with NR waveforms and recover them with pSEOBNRv5HM templates.We use the waveform SXS:BBH:2125 from the Simulating eXtreme Spacetimes Collaboration [144] at the highest available resolution.It provides the signal of a BBH with mass ratio 2 and aligned spins of magnitude 0.3.First, we perform injections for the four total mass and redshift combinations detailed in Sec.III, allowing for deviations from GR in the QNMs when running our Bayesian analysis.Next, we investigate more methodically how systematic effects come into play and propose a novel method to assess their impact.

A. Results on astrophysical systems
First, we consider the case where the injected signal contains only the dominant (2,2) harmonic and include in the pSEOBNRv5HM templates the same harmonic content.We compare in Fig. 13 the posteriors obtained for injections with different total masses, giving SNR = 175 and 1312, respectively.The injected values (blue lines) are well within the 90% confidence regions for the heavier system (black), having total mass M t,0 = 2 × 10 8 M ⊙ .However, due to its much FIG.13.Corner plot of intrinsic and deviation parameters for NR injections recovered with pSEOBNRv5HM templates.Contours show the 68%, 90%, and 95% confidence intervals, and dashed lines the 0.05 and 0.95 quantiles.We compare the results for M t,0 = 2 × 10 8 M ⊙ , z 0 = 2.2 (black) and M t,0 = 2 × 10 7 M ⊙ , z 0 = 3.7 (red); both have q 0 = 2 and χ 1,0 = χ 2,0 = 0.3.To ease comparison, we have rescaled the total mass to the injected value.In each case, both injection and templates contain only the (2, 2) harmonic.The heavier system has lower SNR, and the impact of mismodelling is small when including only the (2, 2) harmonic, so that the injected values (in blue) are well within the 90% confidence regions.It is nonetheless relevant for the lighter system, due to its much higher SNR.In the latter case, one would be misled into thinking that the frequency of the (2, 2) QNM departs from the Kerr prediction.The spins are poorly measured for the heavy system due to the absence of higher harmonics.
higher SNR, the parameter estimation of the lighter system (red), having total mass M t,0 = 2 × 10 7 M ⊙ , is strongly biased.In particular, a deviation from GR in the frequency of the (2, 2) QNM is erroneously detected with high confidence.We recall that our model does not explicitly include overtones beyond n = 0, as discussed in Sec.II.Including them in the merger-ringdown signal could, in principle, reduce this systematic bias, but would require fitting for the amplitude, phase and starting time of each overtone, which has proven very difficult so far [145], and could end up having the opposite effect on the accuracy of the waveform.Next, we inject NR signals containing all the harmonics included in pSEOBNRv5HM and use templates with full harmonic content in the Bayesian analysis.As can be seen in Fig. 14, once we include higher harmonics (red), even for the more massive system the posterior shifts further from the true values while getting narrower, making it incompatible with the true parameters at more than 95% confidence, in particular, for the GR deviation ones.We stress that the worsening for the system shown in Fig. 14 is less due to the moderate increase in SNR than to the inclusion FIG.14.The same as Fig. 13, but now comparing results when including only the (2, 2) harmonic (black) versus when including all harmonics of pSEOBNRv5HM (red), for M t,0 = 2 × 10 8 M ⊙ and z 0 = 2.2.The inclusion of higher harmonics worsens the match between NR and pSEOBNRv5HM waveforms, leading to significant biases in all parameters, in particular, the GR deviation ones.Note that, when including all harmonics, the deviation parameters for all QNMs are allowed to vary, and we find similar biases for all of them, but we show only the posterior for deviations in the (2, 2) harmonic because of space limitation. of higher harmonics, as we demonstrate in Sec.V B. This is not surprising since the accuracy of the SEOBNRv5 model degrades when including higher harmonics, as comprehensive comparisons to ∼ 440 NR waveforms and NR surrogate models have shown [81].One of the difficulties lies in the relative alignment between harmonics.When considering harmonics individually, we have freedom in the alignment (in phase and time) of the waveforms.When including several harmonics, we have a single phase shift and a single time shift that can be varied for all harmonics simultaneously.The relative alignments between them are fixed, and might not agree with the ones of NR waveforms.In the next section, we further illustrate how such tests of GR become less reliable when including higher harmonics.

B. Exploring systematic effects
Our goal here is to understand how systematic effects come into play, and, in particular, how do they depend on which portion of the signal (inspiral or merger-ringdown) dominates.In addition to total masses of 2 × 10 7 M ⊙ and 2 × 10 8 M ⊙ , we also consider M t = 2 × 10 6 M ⊙ .The minimum frequency used to analyse the signal remains  II.Full IMR SNR and its decomposition into the contribution of the merger-ringdown and inspiral stages for the NR waveform SXS:BBH:2125 [144], assuming a total SNR of 175.We show results both for when only the (2, 2) mode is considered and when all harmonics modelled in SEOBNRv5HM are included.
10 7 M ⊙ )] Hz.This is a pessimistic choice for signals with M t = 2 × 10 6 M ⊙ , as they can accumulate significant SNR in the early inspiral, but with this choice, the length of the signal that is analysed (in geometric units) is kept fixed for different total masses.Moreover, we fix the SNR of the sources rather than their distance, to allow for a fair comparison between systems.Unless specified, we take the SNR to be 175 [as for the system with M t = 2 × 10 8 M ⊙ at z 0 = 2.2 when including only the (2, 2) harmonic shown in Sec.V].We stress that the long-wavelength approximation for the LISA response is expected to break for M t = 2 × 10 6 M ⊙ .Overall, the signals considered in this section might not be astrophysically realistic (e.g., we expect systems with M t = 2 × 10 7 M ⊙ to have much larger SNR than 175), but we seek to have a methodic understanding of how systematic effects appear for different signal morphologies.We start by considering injections with the (2, 2) harmonic only, and then with all other harmonics included in SEOBNRv5HM.In both cases, the injected signal and templates have the same harmonic content.In Table II, we give the merger-ringdown and inspiral SNR of the systems we consider, assuming an IMR SNR of 175, for both harmonic content options.

(2, 2) only
In Fig. 15, we compare the posterior distributions obtained for different total masses.Note that, unlike in previous plots, we show the chirp mass, defined as M c = (m 3  1 m 3 2 /(m 1 + m 2 )) 1/5 , rather than the total mass.This is the mass combination that is best measured during the inspiral.We find that, as the total mass decreases, the chirp mass and the mass ratio are better measured and in better agreement with the true value.This is because the fraction of the SNR in the inspiral is larger for lighter systems, and this is the portion of the signal where our templates are in better agreement with NR waveforms.In Fig. 16, we compare the adimensionalised TDI strain A for the highest likelihood point in each case.The upper panel focuses on the inspiral, and we can see that the agreement is better for lighter mass systems.The lower panel focuses on the merger-ringdown portion [we recall that t/M t = 0 is the peak of the (2, 2) amplitude], and we can see that the system with M t = 2 × 10 8 M ⊙ is the one in the best agreement with the injection, as expected from the fact that it is the system for which the fraction of SNR in the merger-ringdown is the largest.
Looking at the caption in Fig. 15, it might appear surprising that the system with M t = 2 × 10 6 M ⊙ is the one for FIG. 15.
Corner plot for the parameters estimated using pSEOBNRv5HM to analyse a mock NR injection for different total masses, considering only the (2, 2) harmonic and fixing the SNR of the injection to be 175.Contours show the 68%, 90%, and 95% confidence intervals, and dashed lines the 0.05 and 0.95 quantiles.In the upper-right part, we show the posterior on the fractional deviations to GR QNMs.which the GR deviation parameters are the largest (while still being very much consistent with 0).The reason behind it is that the determination of intrinsic parameters for this system really comes from the inspiral, and the GR deviation parameters are then estimated to maximise the match in the merger-ringdown portion of the signal.In contrast, the intrinsic parameters of the M t = 2 × 10 8 M ⊙ system are chosen to maximise the match in the merger-ringdown.This becomes clearer if we increase the SNR, as in Fig. 17, where we take the IMR SNR to be 707.The posteriors become thinner and the intrinsic parameters in the M t = 2 × 10 8 M ⊙ case are no longer compatible with the true value.Moreover, for M t = 2×10 6 M ⊙ , we definitely favour nonzero deviations from GR.The fact that this is needed to maximise the match, even for an SNR of 175, is likely a consequence of the model used for the merger-ringdown [see Eq. (2.3) and below].When changing the QNMs, the amplitude and phase coefficients in FIG.16.Comparison between the waveforms of the maximum likelihood points (from the analyses shown in Fig. 15) and the injection waveform.We plot the adimensionalised TDI strain A as a function of time in geometric units to allow for the comparison between different total masses.
Eqs. (2.6a) and (2.6b) change consistently, possibly providing a better match, in particular, around the merger, where we use a purely phenomenological description, and discrepancies between our templates and NR waveforms are larger.Moreover, our model might not be accounting for higher overtones accurately enough.Therefore, favouring a nonzero value for the GR deviation parameters does not necessarily reflect that the physical QNMs are different from the Kerr ones, but rather that, given the functional form assumed, such QNM values provide overall a better fit to data.It would be interesting to explore if we still find such deviations from GR when allowing for the additional modifications around the merger proposed in [67].
Finally, in Fig. 18, we show the distribution of loglikelihood values obtained from running parameter estimation allowing or not for deviations from GR (pSEOBNRv5HM versus SEOBNRv5HM) for the three values of the total mass (fixing the SNR to 175).We find that, as the total mass decreases, the likelihood values increase, and allowing for deviations from GR improves more the fit, in agreement with Fig. 15.We can use the results shown in this plot to estimate the SNR from which it will be favoured to allow for deviations from GR.For this purpose, we introduce the Akaike information crite-FIG.17.The same as Fig. 15 for an IMR SNR of 707.rion (AIC) [146], defined as where L is the maximum likelihood and n p is the number of free parameters.The latter accounts for the dimensionality penalty.When choosing between different models to describe observed data, the one with minimum AIC is favoured.Moreover, we can estimate the log-Bayes factor between model 1 and model 2 as In [147], we justify the use of the expression above in the case of Gaussian posteriors and discuss how it relates to the standard indistinguishability criterion [148][149][150][151][152]. Here, n p is 9 when using SEOBNRv5HM and 11 when using pSEOBNRv5HM.
From the definition of the likelihood [Eq.(2.14)], we can write the log-likelihood (up to an additional constant that depends only on the noise properties of the detector) as ( In general, it can be difficult to accurately estimate ln L from the MCMC alone, because it lies in the higher-end tail of the log-likelihood distribution (see Fig. 18).It becomes harder as the dimensionality of the parameter space increases.To get a better estimate of the maximum likelihood, we start from the Gaussian approximation.Under this hypothesis, (5.9) The mean log-likelihood is less dependent on sampling the tails of the distribution than the maximum log-likelihood.Thus, we can use the log-likelihood samples we get from MCMC to estimate it and then use Eq.(5.9) to estimate L. We recall that our prior on θ is flat, so the likelihood values obtained with MCMC are fair draws of the distribution followed by the log-likelihood.In Fig. 18, we overplot in full lines the probability density function of the theoretical distribution of log-likelihood values assuming that ln L ∼ ln L − 1 2 χ 2 (n p ), and L was estimated with the procedure described above.We can see that the agreement between this prediction and the distribution we obtain with MCMC is remarkable, even though the likelihood is not actually Gaussian in the event parameters.Moreover, we have verified that the integrated weight of the theoretical probability density function above the maximum likelihood we find with our MCMC is typically below 1/N s , where N s is the number of MCMC samples.This indicates that our sampling of the log-likelihood function is compatible with its theoretical estimate.Using this method to estimate L, we find that, for SNR=175, we have ln B < 0 for all three total masses.This is in agreement with Fig. 15, where the posteriors are compatible with GR at least at ∼ 68% confidence.We now estimate the SNR for which pSEOBNRv5HM would be definitely favoured with respect to SEOBNRv5HM.Following the Kass-Raftery scale [153], we adopt the criterion ln B > 3 to estimate that a model is favoured with respect to another.For M t = 2 × 10 6 M ⊙ , we estimate SNR ≳ 330, and for M t = 2 × 10 7 M ⊙ and M t = 2 × 10 8 M ⊙ , SNR ≳ 598 and SNR ≳ 977, respectively.These values are in agreement with Fig. 17, where having a zero value for the GR modifications is supported for M t = 2 × 10 8 M ⊙ , but it is not in the other cases.

All harmonics
In Fig. 19, we show the equivalent of Fig. 18 when including all harmonics, fixing the total IMR SNR to 175.We consider one extra case: using for the template the pSEOBNRv5HM model with the (2, 2) QNM fixed to its Kerr value (i.e., δ f 22 = δτ 22 = 0).The motivation for this is the following: The rationale for this test of GR is that the parameters of the binary are measured from the inspiral; from there, we can estimate the final mass and spin and, therefore, the QNM spectrum.We then seek to measure deviations from this spectrum.However, for heavier systems, most of the information comes from the merger-ringdown and the measurements are no longer "independent."Thus, we want to investigate how the parameter estimation changes when not allowing for deviations in the dominant mode.
First, we note that the likelihood values we get when including all harmonics are lower than when including only the (2, 2) mode, despite the significant increase in the number of free parameters in the pSEOBNRv5HM case (23 versus 11), showing that the agreement with the injected NR waveform worsens.Focusing on the SEOBNRv5HM case first (in black), we find that the hierarchy of likelihood values between the different total masses is compatible with the values reported in Table II: The best fit is for the M t = 2 × 10 6 M ⊙ system, because most of its SNR comes from the inspiral, where our templates are more reliable.In the pSEOBNRv5HM case, we find that the quality of the fit improves significantly, with a larger improvement for more massive systems.Surprisingly, we find higher likelihood values for the M t = 2 × 10 7 M ⊙ system than for the M t = 2 × 10 6 M ⊙ one.This happens because, in the pSEOBNRv5HM case, the additional parameters allow a substantial improvement in the match in the merger-ringdown portion of the signal but not in the inspiral.To better understand this, we start from Eq. (5.4), and make the assumption that (d c |d c ) = (s 0,c |s 0,c ) ≃ (s c ( θ)|s c ( θ)), where θ is the maximum likelihood point.In other words, we assume that the loudness of the recovered signal is virtually the same as the one of the injection.If we further assume that the contribution from each channel is roughly the same, we can write ln L = −2SNR 2 (1 − FF(s 0 , h)), (5.10)where FF is the fitting factor between the true signal and our templates defined as the maximised overlap: (5.12) In the previous equations, we can split between the contribution coming from the inspiral (I) and the one coming from the merger-ringdown (MRD): From Table II, we see that the fraction of SNR coming from the merger-ringdown for the M t = 2 × 10 7 M ⊙ system is larger than the one coming from the inspiral for the M t = 2 × 10 6 M ⊙ one.Thus, if introducing the QNM deviations allows a sufficient improvement in the match in the merger-ringdown for the M t = 2 × 10 7 M ⊙ system, we can obtain higher likelihood values for it than for the M t = 2 × 10 6 M ⊙ system, unlike in the SEOBNRv5HM case.As in the case of the (2, 2) harmonic only, we can estimate the SNRs for which pSEOBNRv5HM is favoured with respect to SEOBNRv5HM, now including all harmonics.Note that now n p = 23 for pSEOBNRv5HM.We find SNR ≳ 68, SNR ≳ 93 and SNR ≳ 214 for M t = 2 × 10 8 M ⊙ , M t = 2 × 10 7 M ⊙ and M t = 2 × 10 6 M ⊙ , respectively.We stress that these values are below the typical SNRs we expect for MBHBs, in particular, for sources with M t = 2 × 10 6 M ⊙ and M t = 2 × 10 7 M ⊙ .Systems with M t = 2 × 10 8 M ⊙ could also have SNR above the respective limit, as is the case for the system shown in Fig. 14, for which GR is indeed excluded at more than 95% confidence.Moreover, our estimates are in agreement with the corner plots shown in Figs. 25, 26 and 27 in Appendix D, where we show the posterior for the different total masses in all three scenarios.For an IMR SNR of 175, only the lighter system is compatible with GR in the pSEOBNRv5HM case.We note that, in Ref. [62], the authors performed a similar study for LIGO detectors at design sensitivity and found the GR value of QNMs to be well within the 90% confidence interval even when injecting an NR waveform with an SNR of 75.This result is in qualitative agreement with our estimate of SNR ≳ 214 for pSEOBNRv5HM to be favoured with respect to SEOBNRv5HM for the M t = 2 × 10 6 M ⊙ system, which is the closest one among our systems (in terms of relative contribution of the inspiral and merger-ringdown) to the system considered in Ref. [62].From Figs. 25, 26 and 27, it is possible to see that the posteriors in the pSEOBNRv5HM with fixed (2, 2) QNM case are slices of the pSEOBNRv5HM posteriors on the δ f 22 = δτ 22 = 0 hypersurfaces, as they should be.It is harder to see that the posteriors in the SEOBNRv5HM correspond indeed to slices of the pSEOBNRv5HM posteriors on the δ f ℓm = δτ ℓm = 0 hypersurfaces, because this slice of the posterior yields much lower likelihood values than explored by the sampler in the pSEOBNRv5HM case, as can be seen from Fig. 19, except in the M t = 2 × 10 6 M ⊙ case (Fig. 27).We find that the chirp mass and the mass ratio tend to be in "better" agreement with their true values in the pSEOBNRv5HM fixed (2, 2) case than in the full pSEOBNRv5HM one, but we have no conclusive evidence of this.The spins are usually wrongly measured, with only a secondary mode in the SEOBNRv5HM case for M t = 2 × 10 6 M ⊙ containing the true values.This suggests that the inclusion of spins in our waveforms is one of the bottlenecks for performing accurate parameter estimation.
It might seem surprising that the likelihood values in the pSEOBNRv5HM with fixed (2, 2) case are only a little lower than in the full case, as illustrated in Fig. 19, even more given that (2, 2) is the dominant mode.A likely explanation for this is as follows.Deviations in the QNMs can partially compensate for the misalignment in harmonics between NR and pSEOBNRv5HM waveforms, as they change the amplitude and phase of the different harmonics, and improve the match, reinforcing our claim in the discussion around Fig. 15: Measuring nonzero GR deviations does not necessarily mean that the physical QNMs are different from their predicted Kerr values.However, the phase shift and the time shift that we fit when doing parameter estimation are defined with respect to the peak of the (2, 2) amplitude.Therefore, it is less crucial to allow for deviations in this harmonic.It would be interesting to investigate how the match worsens when fixing other QNMs, in particular, the next-to-dominant (3,3), or when fixing several but not all of them.We leave this for future inves-tigation.
These studies show that, because of the high SNRs that we will reach with LISA, the accuracy requirement for waveforms is much more stringent than for current ground-based detectors.Moreover, our results suggest that particular attention should be paid to the modelling of higher harmonics, as their inclusion increases biases, and of spins, as their measurement tends to be very biased.Let us stress that even the NR waveforms we currently use are not accurate enough for SNRs of thousands [81,152] and would need to be improved by at least one order of magnitude.

VI. CONCLUSION
Gravitational-wave observations have provided us with brand-new opportunities to test GR.In particular, ringdown tests are one of the most promising possibilities to detect deviations from GR.In this work, we have assessed how a fully consistent modelling of the IMR signal will allow us to perform high-precision ringdown tests with MBHB observations by LISA.To do so, we have performed synthetic injections of astrophysically realistic systems and analysed them with templates in GR (SEOBNRv5HM) and with parameterised deviations from GR (pSEOBNRv5HM), using the newly released SEOBNRv5HM waveform family [79][80][81][82]154].More specifically, the pSEOBNRv5HM templates allow for deviations to the QNMs (frequency and damping time) of all the harmonics included in the model.All our analyses have been done in a fully Bayesian framework.
First, we have considered the case where we use the GR SEOBNRv5HM templates for both the synthetic injection and the Bayesian analysis.We find that having a consistent modelling of the whole signal allows us to measure the parameters of the binary accurately, even for signals with very little SNR in the inspiral (e.g., very massive MBHBs with total mass ∼ 10 8 M ⊙ ).Source-frame masses can typically be measured within 10% and even within 1% for systems with total mass ∼ 10 7 M ⊙ .Spins can be measured within 0.1 and down to 0.001 for the primary BH.Second, we have shown that deviations to the QNMs of the dominant harmonics [i.e., the (2, 2), (3,3), (4,4) and (5,5) harmonics], can be constrained within 10% and down to 1% for systems with total mass of the order 10 7 M ⊙ .Those are also the magnitude of deviations that we could measure in a non-GR signal.Converting the measurement of fractional deviations into measurements of QNM frequencies, we find that for most systems we could accurately measure and distinguish the QNMs of several harmonics, up to all seven included in the pSEOBNRv5HM model in the most favourable cases (i.e., total mass 10 7 M ⊙ , mass ratio ∼ 2 − 4, highly spinning, and at z ∼ 2).
Then, we have assessed the impact of systematics on ringdown tests by using NR waveforms to perform synthetic injections and analysing them with pSEOBNRv5HM.In order to estimate at which SNRs we would expect to erroneously measure deviations from GR, we have developed a novel approach based on the Akaike information criterion We have found that, in particular, when higher harmonics are included, parameter estimation is significantly biased already for SNRs of O(100), leading to the erroneous detection of deviations from GR in high SNR signals.The results we have obtained when using pSEOBNRv5HM for the injection and the Bayesian analysis give us a sense of the incredibly high precision to which we will be able to perform ringdown tests with LISA.However, in order not to jeopardise those tests, the accuracy of our waveform models needs to be improved far beyond current standards, which is one of the major challenges facing the GW community over the next few years.
In this work, we have focused on MBHB systems that are in the high-mass end of predictions for LISA observations, typically produced in astrophysical models where MBHs form from the evolution of heavy seeds [90].This is because those are the ones for which we expect the highest SNR in the merger-ringdown [74,75].However, it would be interesting to assess how observations of lighter systems, which might be more numerous, could be used to detect deviations from GR in the ringdown.Also, we have neglected the effect of spin precession and eccentricity, which might not be appropriate, in particular, if MBHBs harden through triplet interactions [86].Finally, the waveform model we used in this work allows for deviations from GR only in the ringdown, whereas, if deviations are present, we should expect them to affect the whole signal.Different theory-agnostic formalisms have been proposed to account for deviations in the inspiral [155][156][157], typically by modifying the post-Newtonian expansion of the GW phase [158], and progress has recently been made to account for deviations in the plunge-merger stage [67].It would be interesting to assess how to link modifications in different parts of the signal or at least to assess how the constraints change when accounting for all possible modifications.We leave these studies for future work.
After the first submission of this work we identified an error in our parameter estimation pipeline: we were missing a factor of 1  2 inside the exponential entering the likelihood [Eq.2.14].The SNR values cited in the original version of the paper were not affected by this error.Because we consider zero-noise injections, this can be interpreted as, when performing parameter estimation, either (i) placing the sources a factor √ 2 closer, or (ii) dividing the LISA PSD by 2, i.e. considering a level of noise √ 2 lower.In both cases the SNR of the systems for which we performed parameter estimation need to be multiplied by √ 2. We have decided to adopt the first possibility, and changed the redshifts of the fiducial sources we consider from 3 and 5 to 2.2 and 3.7, and applied the corresponding √ 2 factor to all SNR values quoted in the paper and shown in the figures.We stress that, given the uncertainty on the exact LISA noise, and that we use the SciRdv1 noise curve, which is considered to be a pessimistic estimate of the LISA noise, the second possibility would also have been a reasonable choice.This updated version matches the one published in Phys.Rev. D.  20 induces secondary maxima over the parameter space.Those lead to very non-Gaussian distributions and induce an apparent bias when projecting onto one-dimensional posteriors.However, as indicated by the black and red lines, the maximum-posterior point is close to the synthetic injection, as expected when using flat priors.
using the IMRPhenomTHM and the SEOBNRv5HM waveforms.Black lines correspond to y = x.We see that the estimates are in good agreement, with a slight discrepancy for the error on spins, in particular, in the case of high-spin systems (circles).In this regime, our current waveforms are less accurate, so the agreement between them is worse (see also comparisons between these two waveform models in Ref. [81]).Errors are similar, with a slightly higher discrepancy for the spins, in particular, for high-spin systems (circles).FIG.24.Comparison of the measurement error on intrinsic parameters when setting deviation parameters to 0 (x axes) and when letting them vary (y axes) for a GR injection.Black lines show y = x.Measurements worsen when allowing for deviations from GR to vary but remain comparable to the GR case.

FIG. 1 .
FIG.1.Time evolution near the merger, which occurs at t = 0, of the SEOBNRv5HM (2, 2) harmonic amplitude (top panel) and instantaneous frequency (bottom panel) when varying the spin components of a binary with mass ratio 2, assuming equal and aligned spins.

FIG. 3 .
FIG. 3. Total SNR (black) and SNR in the individual harmonics (colours) for all the systems at z 0 = 2.2.The values at z 0 = 3.7 can be obtained by rescaling by 18, 331/33, 691 ≃ 0.54.Circles show the full IMR SNR and squares the SNR in the inspiral stage.

FIG. 7 .
FIG.7.Width of the 90% confidence interval on δ f ℓm centred around the median for GR injections.

FIG. 9 .
FIG. 9. 90% confidence interval on the QNMs for four different systems.Crosses indicate the true values.Between columns only the total mass varies, and between rows mass ratio, spins, and redshift vary, keeping the same total mass.In every case, the synthetic injection is GR.The top-left panel illustrates why applying the Rayleigh criterion to both the damping time and the frequency to decide on the resolvability of QNMs is too stringent: Although the one-dimensional damping time posteriors overlap, the two-dimensional ones are well separated.That case corresponds to a "best-case scenario."The other panels show cases where not all seven QNMs can be distinguished.The (2, 1),(3,2), and (4, 3) modes are typically less-well constrained due to their lower SNR (see also Figs.7 and 8).

FIG. 11 .
FIG. 11.Width of the 90% confidence interval centred around the median for δ f ℓm for injections with δ f ℓm = δτ ℓm = 0.01, as indicated by the black-dashed lines.

. 4 ) 2 . ( 5 . 5 ) 2 −
For a zero-noise injection, d c = s 0,c , and (d c |d c ) = SNR 2 c is the SNR of the signal in the TDI channel c.If we vary the loudness of the injection by changing the distance (as we do here) and rescale the template by the same factor, which corresponds to rescaling the distance, then (d c |d c ) is the only term that depends on the SNR.Thus, we get the simple scaling for the log-likelihood: ln L = ln L 175 SNR 175 Note that this scaling does not rely on approximating the likelihood as a Gaussian on the event parameters.Combining Eqs.5.1, 5.2, and 5.5, we find that the log-Bayes factor for pSEOBNRv5HM versus SEOBNRv5HM is ln B = ln LpSEOBNRv5HM,175 − ln LSEOBNRv5HM,175 SNR 175 (n p,pSEOBNRv5HM − n p,SEOBNRv5HM ).(5.6) ) where θ ′ are the coordinates of θ in the basis of eigenvectors of the covariance matrix C and σ ′2 i are the eigenvalues of C. From Eq. (5.8), we can see that 2(ln L − ln L) follows a χ 2 distribution with n p degrees of freedom.Since the mean of a χ 2 distribution with n p degrees of freedom is n p , denoting the mean by ⟨•⟩, we have ln L = ⟨ln L⟩ + n p 2 .

FIG. 18 .
FIG. 18. Distribution of log-likelihood values when analysing withSEOBNRv5HM and pSEOBNRv5HM for different total masses.

FIG. 21 .
FIG. 21.Comparison between parameter estimation results when using low-tolerance waveforms (black) and standard ones (red) for the same system as in Fig. 20.Blue lines indicate the injection point; black and red symbols indicate the maximum-posterior point of the low-tolerance and standard-tolerance runs, respectively.The nonsmoothness of the likelihood function illustrated in Fig.20induces secondary maxima over the parameter space.Those lead to very non-Gaussian distributions and induce an apparent bias when projecting onto one-dimensional posteriors.However, as indicated by the black and red lines, the maximum-posterior point is close to the synthetic injection, as expected when using flat priors.

FIG. 23 .
FIG.23.Comparison between the measurement errors with SEOBNRv5HM (x axes) and with IMRPhenomTHM (y axes).Black lines show y = x.Errors are similar, with a slightly higher discrepancy for the spins, in particular, for high-spin systems (circles).

FIG. 25 .
FIG.25.Corner plot for the parameters estimated in three different scenarios for a total mass of 2 × 10 8 M ⊙ including all harmonics of SEOBNRv5HM and fixing the SNR of the injection to be 175.Contours show the 68%, 90%, and 95% confidence intervals, and dashed lines the 0.05 and 0.95 quantiles. 81]

TABLE I .
Full IMR SNR and its decomposition into the contribution of the merger-ringdown and inspiral stages for the systems considered in this work at z 0 = 2.2.Values at z 0 = 3.7 can be obtained by rescaling by 18, 331/33, 691 ≃ 0.54.We recall that the inspiral and mergerringdown SNRs add quadratically.All the systems we consider are merger-ringdown dominated, although some of them have high SNR in the inspiral as well, up to thousands.