Ready for what lies ahead? -- Gravitational waveform accuracy requirements for future ground based detectors

Future third generation (3G) ground-based GW detectors, such as the Einstein Telescope and Cosmic Explorer, will have unprecedented sensitivities enabling studies of the entire population of stellar mass binary black hole coalescences in the Universe. To infer binary parameters from a GW signal we require accurate models of the gravitational waveform as a function of black hole masses, spins, etc. Such waveform models are built from numerical relativity (NR) simulations and/or semi-analytical expressions in the inspiral. We investigate the limits of the current waveform models and study at what detector sensitivity these models will yield unbiased parameter inference for loud ''golden'' binary black hole systems, what biases we can expect beyond these limits, and what implications such biases will have for GW astrophysics. For 3G detectors we find that the mismatch error for semi-analytical models needs to be reduced by at least \emph{three orders of magnitude} and for NR waveforms by \emph{one order of magnitude}. In addition, we show that for a population of one hundred high mass precessing binary black holes, measurement errors sum up to a sizable population bias, about 10 -- 30 times larger than the sum of 90\% credible intervals for key astrophysical parameters. Furthermore we demonstrate that the residual signal between the GW data recorded by a detector and the best fit template waveform obtained by parameter inference analyses can have significant SNR ratio. This coherent power left in the residual could lead to the observation of erroneous deviations from general relativity. To address these issues and be ready to reap the scientific benefits of 3G GW detectors in the 2030s, waveform models that are significantly more physically complete and accurate need to be developed in the next decade along with major advances in efficiency and accuracy of NR codes.

Future third generation (3G) ground-based gravitational wave (GW) detectors, such as the Einstein Telescope and Cosmic Explorer, will have unprecedented sensitivities enabling studies of the entire population of stellar mass binary black hole coalescences in the Universe, while the A+ and Voyager upgrades to current detectors will significantly improve over advanced LIGO & Virgo design sensitivities. To infer binary parameters from a GW signal we require accurate models of the gravitational waveform as a function of black hole masses, spins, etc. Such waveform models are built from numerical relativity (NR) simulations and/or semi-analytical expressions in the inspiral. We investigate the limits of the current waveform models and study at what detector sensitivity these models will yield unbiased parameter inference for loud "golden" binary black hole systems, what biases we can expect beyond these limits, and what implications such biases will have for GW astrophysics. For 3G detectors we find that the mismatch error for semi-analytical models needs to be reduced by at least three orders of magnitude and for NR waveforms by one order of magnitude. In addition, we show that for a population of one hundred high mass precessing binary black holes, measurement errors sum up to a sizable population bias, about 10 -30 times larger than the sum of 90% credible intervals for key astrophysical parameters. Furthermore we demonstrate that the residual signal between the GW data recorded by a detector and the best fit template waveform obtained by parameter inference analyses can have significant signal-to-noise ratio and can lead to Bayes factors as high as 10 11 between a coherent and an incoherent wavelet model for the population events. This coherent power left in the residual could lead to the observation of erroneous deviations from general relativity. To address these issues and be ready to reap the scientific benefits of 3G GW detectors in the 2030s, waveform models that are significantly more physically complete and accurate need to be developed in the next decade along with major advances in efficiency and accuracy of NR codes.

I. INTRODUCTION
Observations of gravitational waves (GWs) from coalescing compact object binaries have revolutionized our knowledge about the Universe and provided access to astrophysics previously outside our grasp [1]. These observations were made possible by the construction and operation of a network of GW detectors, Advanced LIGO [2], Advanced Virgo [3] and KAGRA [4]. As expected from a relatively young field of observational astrophysics, there are still significant technological improvements within reach [5][6][7] increasing the sensitivity of both current generation GW detectors over the next ∼ 5 years [8] in addition to paving the way for next-generation ground based facilities [9][10][11][12] as well as space-based observatories [13,14], all planned to be operational in the early 2030s. This will allow for direct observation of all stellarmass binary black hole (BBH) mergers throughout the cosmological history of the Universe [15] and will enable unprecedented and unique science in extreme gravity and fundamental physics [16]. Whereas the vast majority of observed GW signals will have originated at large cosmological distances [17,18], binaries from the currently observable volume of the Universe will, as the detector sensitivities improve (see Fig. 1), be observable with increasing fidelity. GW models are crucial for elucidating the astrophysical properties of compact binaries. For this increase in the information available for a typical BBH observation, these models need to satisfy stricter accuracy requirements. The currently available model waveforms, which approximates the solutions to the 2-body problem in general relativity (GR), have been shown to be sufficiently accurate to not cause any systematic biases in the recovered parameters (masses, spins, location in the Universe etc.) for BBHs observed so far [19]. These "golden binaries", stellar-mass BBHs like GW150914 (the first direct GW detection [20]) observed at high signal-to-noise ratio (SNR) and with high fidelity, have also allowed to perform previously inaccessible tests of GR [21] and the robustness of the waveform models used. As the GW detectors improve, the expected SNR for BBHs from the local Universe will increase correspondingly, thus reducing the statistical uncertainties in the recovered source parameters. As the statistical uncertainties approach the inherent systematic uncertainties of the GW approximant models, parameter biases will eventually appear and reduce the reliability of future GW observations. In this study we want to investigate the appearance and significance of these parameter biases, their connection to the accuracy of the GW models used and what relevance the biases will have on future astrophysical statements based on high SNR observations of BBH systems [22]. In addition to possible biases in the source parameters of the BBH, the use of GW observations as means to test GR puts even more stringent requirements on GW model accuracy [21,[23][24][25]. If there are effects of beyond-GR theories embossed on the "raw" GR waveform, these effects will be scrambled by any residual signal left by an inaccurate GW model and thus will limit the strength of the GR test. Even worse, inaccurate GW models may lead to erroneous results claiming deviations from GR. Many of these analyses, both parameter estimation (PE) studies and tests of GR, are strongly dependent on robust observation of the two polarisation states of a GW signal as described by GR [26]. This is primarily done by requiring a coherent observation of a given GW signal in more than one detector [27][28][29], which is also crucial for accurate and reliable localisation of the GW source in the Universe. Whereas the BBHs investigated here are not expected to produce any observable counterpart [30][31][32], precise localisation is crucial for cosmology studies [33][34][35][36][37] as well as for inferring the parameters of the BBHs in their source frame [38].
In addition to biases caused by inaccurate GW models, the data generated by the detectors themselves carry inherent uncertainties originating from the calibration process applied to the raw detector output [39] as well as imprecise modelling assumptions for the noise processes of a given detector system [40]. These types of uncertainties can however already be quantified, and thus incorporated into the PE infrastructure allowing their effects to be marginalised out from the final inferred parameter distributions [1,41,42]. The marginalisation over calibration uncertainties, and similarly the marginalisation over eventual uncertainties in the noise estimation, is primarily expected to broaden the recovered posterior distributions thus effectively absorbing any misestimate in the GW amplitude or phase irrespective of whether it originated from uncertainties in the data itself or from the assumed waveform model. The rest of the paper presents the details of our study. In Sec. II we describe the GW models used, together with details on the analysis methods. In Sec. III we report our findings on the analysis of individual "golden binary" BBH signals, including requirements on the accuracy of the GW models and the consequences any inaccuracies will entail. In Sec. IV we explore a population of BBH observations, and what effects GW model accuracy will have on the properties of the inferred population. Finally in Sec. V we discuss our findings and present an outlook for how to tackle the issues we have presented. Evolution of sensitivity of GW interferometric detectors. Text data files for the PSDs or ASDs can be found in [43,44] under the names given in this table.

II. METHODOLOGY
In this section we introduce the methods we use to study the impact of waveform inaccuracies on measurements of compact binary parameters from GWs in Secs. III and IV. We first discuss common data analysis tools for GW waveforms in Sec. II A and numerical relativity (NR) waveforms, the most accurate waveforms we have available for the solutions of the 2-body problem in GR, in Sec. II B. In addition to NR waveforms, we also use post-Newtonian -numerical relativity (PN-NR) hybrid waveforms as described in Sec. II C to more fully fill the band of more sensitive detectors to lower frequencies as mock signals in this study. In Sec. II D, we discuss fast, but approximate semi-analytic models of the GWs emitted from compact binaries. These models are crucial to infer binary properties with Bayesian parameter estimation methods for single events and populations of binaries, as discussed in Sec. II E.

A. Gravitational waveforms and overlaps
Gravitational waveforms are often decomposed in a basis of spherical harmonics −2 Y lm . The two GW polarizations can be expanded into modes as We define the overlap, or match, between two waveforms h 1 and h 2 as where represents a noise-weighted inner product with PSD S n , and * denotes complex conjugation. We also maximize the overlap over a time and phase-shift between the two waveforms. We often quote the mismatch, 1 − O(h 1 , h 2 ) instead of the overlap.

B. Numerical relativity waveforms
We use two NR simulations of binary black hole coalescences by the SXS Collaboration [45] using the Spectral Einstein Code (SpEC) [46] code. The first simulation, SXS BBH 0308 [45,47], was performed at parameters inferred from the LIGO PE analysis of GW150914 with semi-analytic waveform models [20,48] and was subsequently used to study possible effects of waveform systematics on the inferred parameters [19,49]. The waveform describes a nearly equal mass binary with small effective aligned spin and moderate precession (see Table I). The waveform accumulates 12.6 orbits and a length of 2822M in time before the formation of a common horizon. The mismatch between simulations at different resolutions at the total mass of GW150914 with aLIGO design sensitivity is ∼ 2 × 10 −4 . We use the highest resolution available, Lev5.
The second simulation, SXS BBH 0104 [50,51], is at mass-ratio 1:3 and has some effective aligned and precession spin. Systems at this mass-ratio still lie within the population posterior for the mass-ratio that has been found in the LIGO & Virgo O1 & O2 analysis [1]. The waveform accumulates 21.9 orbits and a length of 5192M before the formation of a common horizon. There is only a single resolution, Lev5, available for this simulation. An estimate for the mismatch for a simulation using similar technology (SXS BBH 0053 [51,52]) gives ∼ 10 −3 at the total mass of GW150914 with aLIGO design sensitivity.
These waveforms are for quasi-circular inspirals and mergers of BBHs. Since initial conditions are not exactly known, there is a low amount of residual eccentricity in these simulations. For SXS BBH 0308 eccentricity at the relaxed time is estimated to be ∼ 0.0005 while for SXS BBH 0104 it is ∼ 0.001. We do not consider the effect of eccentricity in the waveform in this study.

C. Hybrid waveforms
We use an extension of the GWFrames [53,54] package to hybridize NR with post-Newtonian (PN) waveforms. First we read in an NR waveform and its horizon data (i.e. the spins and orbital track data computed from the apparent horizon finder). We generate a PN waveform at the physical parameters of the NR configuration and align it by shifting in time and attitude to match the NR waveform. The waveform modes and the quaternions describing the motion of the inertial frame are then blended over a hybridization region in time. More details about the procedure and the hybrid waveforms are given in App. A.
Estimates of the accuracy of hybrid PN-NR waveforms are difficult to obtain. Hybrid errors are expected to be significantly higher than for pure NR waveforms due to errors in the PN part of the waveform and additional errors from smoothly combining the PN and NR waveform modes over a blending window in time [55][56][57][58]. We show in App. A that hybridization errors are lower than NR error estimates for the simulations considered in this study. Semi-analytical waveform models usually have good accuracy in the inspiral and are less accurate near merger. Therefore, the PN-NR hybrids used as mock signals in this study should be much more accurate than the semianalytic waveform models described in Sec. II D which we use as template waveforms.

D. Waveform models
In this study we use two fast frequency domain waveform models as template waveforms. These are the IMRPhenomPv2 [59,60] and SEOBNRv4 ROM [61] inspiralmerger-ringdown (IMR) models.
IMRPhenomPv2 uses the aligned-spin IMRPhenomD [62,63] model as a base waveform in the co-precessing frame and twists up its (2, ±2) modes with a PN prescription of the Euler angles that describe the motion of the inertial frame for precessing black hole binaries, thus generating all = 2 modes [64,65]. The model also assumes that the opening angle of the precession cone is small [60] which make it most suitable for binaries with small to moderate precession and moderate mass-ratios. The model has been shown to be smooth [66] up to mass-ratio q ∼ 1/4.
SEOBNRv4 ROM is a frequency domain reduced order model of the time domain SEOBNRv4 effective-one-body model [61] using the methodology developed in [67,68]. The model describes the (2, ±2) modes for non-precessing binaries and can be used for a wide range in mass-ratio and BH spin magnitudes up to maximal spin.
Both IMRPhenomD which underlies IMRPhenomPv2 and SEOBNRv4 have been tuned to NR waveforms in the nonprecessing sector. While more complete models in terms of precession are available [69][70][71] we were not able to use them for this computationally demanding study because we could not obtain converged posterior distributions in time. Models that also include higher harmonics [72] or are computationally more efficient [73] are now becoming available.
In the population study described in Sec. IV we use NRSur7dq2 to represent the population of astrophysical signals [74]. These signals were stochastically drawn and thus we could not use NR simulations which are only available at specific points in parameter space. The NRSur7dq2 NR-surrogate model is however a very good approximation to NR waveforms. It describes generic precessing systems with mass-ratios up to q = 1/2 and spin magnitudes of 0.8. NRSur7dq2 is built from multi-ple surrogates that model waveform mode combinations in the co-orbital frame, the averaged frequency of the (2, ±2) modes in the co-precessing frame, and the frame motion through the right hand sides of the precession equations. We intended to also use NRSur7dq2 as a template waveform for the study discussed in Sec. III, but, while being very accurate, this model has a limited length and this severely limits the mass space that can be explored to high mass systems and high starting frequencies.

E. Bayesian parameter estimation
The inference of the source parameters θ of a GW signal is expressed as a posterior probability density function (PDF) p( θ|d(t)) as part of a PE analysis given the data d(t) recorded from the detectors. Through application of Bayes' theorem, p( θ|d(t)) is directly proportional to the likelihood L(d(t)| θ) of observing the data given an assumed waveform model h(t; θ), in turn characterized by the source parameters θ, together with the prior probability π( θ).
For the analysis of the "golden binaries" in Sec. III this prior is defined to be uniform over the two-dimensional space defining the masses of the binary objects, m 1 and m 2 (with m 1 ≥ m 2 ), as observed in the rest frames of the GW detectors. The dimensionless spins of the black holes (BHs) are assumed to follow a prior uniform in spin magnitude (between 0 and 1) allowing for isotropic and uncorrelated directions of the two black hole spins. We also assume an isotropic prior for the location of the GW on the sky, and a distance prior corresponding to a homogenous rate density in the nearby Universe. For these analyses, we disregard any cosmological corrections to the rate density which for the redshifts explored (z ∼ 0.1) are expected to be negligible. The orientation of the binary follows a prior probability uniform in the polarisation angle ψ and in the cosine of θ JN , the angle between the total angular momentum J and the line of sight N. The parameter space defined by π( θ) is, for the golden binaries analysed in Sec. III, explored stochastically using a Markov Chain Monte Carlo code implemented as part of the LALInference package [41,75] available as part of the LSC Algorithm Library (LAL) [76].
For the analysis of the BBH population in Sec. IV the Bilby inference package was used [77,78] exploring the parameter space using the Nested Sampling algorithm dynesty [79]. Here, similar parameterizations and prior assumptions as for the analysis in Sec. III were made. The analyses however differ in their assumptions over BH masses, here using a prior uniform in the binary chirp mass M = (m 1 m 2 ) 3/5 /(m 1 + m 2 ) 1/5 and the asymmetric mass ratio q = m 2 /m 1 as well as assuming a prior on distance that is uniform in co-moving volume. The different prior choices between Sec. III and Sec. III are not expected to have significant impact on the recovered pa-rameters, or on the conclusions about waveform accuracy requirements based on this inference.
In a multi-detector PE analysis we project the signal and template waveforms on the interferometric GW detectors and compute the strain from the waveform polarizations (+ and ×) and their corresponding detector antenna pattern functions [80] h(t; θ) = h + (t; θ)F + (ra, dec, ψ) + h × (t; θ)F × (ra, dec, ψ).
(4) As the focus of this study is on effects of accuracy of the waveform themselves, the signal waveforms representing the true GW signals are added to a time-series containing no noise, as the standard assumption of Gaussian noise could introduce random biases in the recovered parameters. The true GW strain h(t; θ) as emitted by the GW source may however differ from h M (t; θ), the strain measured by the detectors, due to uncertainties in the calibration of the detectors and their recording of the GW strain [39,81]. We can model the relation between the measured and true strain as forh(f ; θ) andh M (f ; θ) where the tilde denotes the Fourier transforms of the time-domain strain h(t; θ) and h M (t; θ) respectively. The uncertainty in the strain amplitude and phase, caused by uncertainties in the detector calibration, are characterized by the terms δA(f ; θ cal ) and δφ(f ; θ cal ) that are nominally expected to vary both across the bandwidth used for the observation as well as over time from observation to observation. The frequency-dependent correction factors are modelled as cubic splines with nodes spaced uniformly in log f , each with an independent δA and δφ parameter [82] which are then numerically marginalised over. For this study, we assume zero-mean Gaussian priors on δA with a standard deviation of 1% (5% for the O1 analysis) and for δφ a standard deviation of 1 • (5 • for the O1 analysis). The magnitude of these amplitude and phase uncertainties are consistent with the performance of the LIGO detectors during O1 [42,83] [84], and the predicted calibration uncertainties for future detector configurations [85,86].

Hierarchical inference
For the population study detailed in Sec. IV, the Bilby inference package [77,78] was used for both the analysis of individual BBHs as well as the subsequent inference on their population parameters. Following the analysis of each individual GW signal assumed to be part of the observed population, their joint population properties, here a single parameter α, can be inferred as a hyper-posterior [87] where L tot ( d|α) is the hyper-likelihood, π(α) is the hyper-prior, d is a collection of data for N independent events drawn from the injection distribution. We write the injection prior as π(θ|α) and our goal is to estimate the hyper-posterior which in turn relies on a hyper-likelihood that can be written as where π(θ k i |ø) denotes the default prior that is used to perform single event parameter estimation and Z ø (d i ) is the evidence obtained for event i. The integral is then approximated in a Monte-Carlo sense, using the single event posterior samples that have been obtained previously.

III. RESULTS FOR GOLDEN BINARIES
In this section we give predictions about parameter biases that would arise if we used current BBH semianalytic waveform models to infer the properties of high mass BBHs in a sequence of past, current and future ground based detector networks.
We select two exceptionally loud "golden binaries": one binary with parameters mimicking GW150914 and one binary at mass-ratio 1:3. Both systems contain BHs with spins misaligned with the orbital angular momentum vector causing the systems to be moderately precessing. We hold the luminosity distance of the systems constant, so that more sensitive detector networks will observe them with higher SNRs and obtain more precise measurements. Parameters for these systems are given in Table I. As signal waveforms we use NR simulations from the SXS [45] catalog computed with the SpEC code [46], as described in Sec. II B. Since these waveforms are too short to fill the frequency band of future interferometers which extends well below 20 Hz, we hybridize the NR waveforms with PN approximants in the inspiral, including higher order modes up to = 8. We use the effective precession spin IMR waveform model IMRPhenomPv2 for our main results and also quote complementary results for the non-precessing SEOBNRv4 ROM model. IMRPhenomPv2 includes = 2, m = ±2 modes in the co-precessing frame, and a PN description of the motion of the co-precessing frame with an approximation for small precession angles [59,60,62,63].

A. Indistinguishability
We want to find an estimate that predicts beyond which SNR a particular waveform model that is used as a template in PE yields biased posterior distributions for the above BBH signals. We can find the answer by calculating the posterior distribution using Bayesian inference. However, this method is fairly costly for the very sensitive future detectors (see Fig. 1) where the signals have SNRs up to several thousands. Therefore, we compare against and extend a simpler metric for predicting the presence of biases.
If two waveforms h 1 and h 2 fulfill the criterion [88][89][90][91] for a given PSD and SNR ρ then they are deemed indistinguishable, i.e, δh|δh < 1 and the posterior PDF should be unbiased in the sense that systematic errors from waveform inaccuracies are smaller than 1 − σ statistical errors. While this criterion is simple to evaluate, there are several problems that affect its usefulness in practise: The criterion is only sufficient, but not necessary and as a result it tends to be too conservative. Namely, if it is violated, biases can, but need not arise. In addition, the pre-factor D is not known precisely. It can be derived as the number of (intrinsic) parameters whose measurability is affected by model inaccuracy [91]. The criterion also applies only in the high SNR limit as is the case for the Fisher information matrix [92].
To enhance the usefulness of the indistinguishability criterion we use the following procedure to tune the prefactor D.
1. We compute posterior distributions for a sequence of detector networks on the above synthetic signals.
2. From the posterior distributions we compute statistical and systematic errors for key parameters (chirp-mass, mass-ratio, effective aligned spin, and effective precession spin).
3. We estimate the network (balance) SNR ρ b at which the computed systematic and statistical errors become comparable.
4. We compute the mismatch 1 − O(h model , h true )(θ true ) between the template waveform and the signal at signal parameters for a representative detector sensitivity.

Finally, we calculate
We present results of applying this procedure to the selected golden binaries in Sec. III B. First we discuss some assumptions we make in applying it.
When computing the balance SNR and the mismatch we have to assume a power spectral density (PSD). We find empirically that systematic and statistical errors become comparable at network SNRs of ∼ 60 for the above sources. This SNR is found at aLIGO design sensitivity for SXS BBH 0308 and at about A+ sensitivity for SXS BBH 0104. The mismatch is only sensitive to the shape of the PSD and the frequency range of the overlap integral. We pick aLIGO design sensitivity [44] as a reference PSD since this is close to the sensitivity where the balance SNR is found, and it is in its vicinity that the tuned indistinguishability criterion should be most accurate. In general we expect that mismatches will degrade as we approach future detectors since they will be sensitive to lower frequencies and will have significantly more waveform cycles in band. The network SNR determines the discerning power of a network of detectors since we analyze the signal coherently. We neglect that the interferometers that make up detector networks usually have different sensitivities and pick a representative PSD. We use this PSD to compute the single interferometer mismatch in the indistinguishability criterion.
We use mock signals as a proxy for the true waveform obtained from exactly solving the two body problem in General Relativity. Hence we also assume that GR is the correct theory of gravity. Ideally our mock signals would be pure NR waveforms. This is in general not feasible since the cost of computing BBH coalescences with NR simulations scales very steeply with the initial frequency, so that in practise only part of the detector band can be filled by the NR signal for high mass BBHs. Therefore, we hybridize NR signals with PN inspiral waveforms.
NR simulations are only approximations of true GR waveforms. NR accuracy depends on the choice of configuration (e.g. more unequal mass-ratios and higher spin systems are harder to simulate accurately as the size of the apparent horizon of the BHs decreases) and on the size of the grid used to discretize Einstein's equations. In reality, NR simulations use multiple domains and a particular discretization method (finite differences [93][94][95], multi-domain spectral collocation methods [46] or more advanced methods, such as discontinuous Galerkin [96,97]). While we can obtain a good estimate of the NR waveform error by computing mismatches for the same physical configuration but different grid sizes to decrease the truncation error and wave extraction errors, it is difficult to estimate the error in a hybrid waveform. We discuss this further in App. A.
In the above procedure for estimating the pre-factor D we need to find the SNR at which the systematic and statistical errors are comparable. We know that parameters are in general correlated and thus we should take these correlations into account when estimating these errors. The indistinguishability criterion also makes this assumption. When quoting parameter estimation results we rely on errors computed from one and twodimensional marginal posterior distributions, which are straightforward to compute and present. Therefore, we also compute the statistical and systematic errors from 1D marginal posteriors. A more conservative measure of the error is to compute where the injection lies in the posterior distribution, or a marginal PDF thereof. We obtain the percentile of the credible level of the injected parameters in the full posterior by performing parame-ter estimation with all sampling parameters fixed, except for the time and phase of coalescence. Detailed measurements of the latter are of no astrophysical interest, and as they can very strongly affect the likelihood, we prefer to marginalize over them.
We also consider a third method where we take into account the correlations in a set of key parameters only. To do this, we compute a kernel density estimate (KDE) of the marginal posterior distribution in the parameters of interest, compute the posterior probability value at the injection parameters and find its credible level in the marginal posterior. We compute a Gaussian KDE K(θ) = KDE[p(θ|d)] of the marginal posterior distribution p(θ|d) and then solve numerically the equation Q(K(θ (i) ); p) = K(θ s ) to find at which percentile 100p the true parameters θ s of the signal lie in the marginal posterior. Here Q is the quantile function Q(PDF; p) = CDF −1 (p) for a given PDF and its cumulative distribution function (CDF). In practise we work with the logarithm of the PDF to reduce the dynamic range. We discuss results from these procedures in the next section.

B. Predicted Waveform Accuracy Requirements
We now apply the procedure presented in Sec. III A to posterior probability distributions and mismatches obtained for the two mock BBH signals shown in Table I for a series of detector networks. The networks are defined by the positions of the detectors on the Earth and their PSDs as listed in Table II. Fig. 2 shows the main results. According to Eq. 8 the general takeaway is that as long as the mismatch for a given semi-analytical waveform model against the mock signal (red lines) lies below the tuned indistinguishability curve (light or dark blue lines) we do not expect parameter recovery to be biased. One can think of the indistinguishability curve showing the "acceptable error" for a waveform model for a particular SNR. Without tuning, the predicted SNR above which we would see biases (assuming that 6 intrinsic model parameters are affected) is about 25 for the SXS BBH 0308 NR signal (and SNR 11 for the hybrid). For SXS BBH 0104 it is predicted to be an SNR of ∼ 6. As we will see in Sec. III C these predictions are certainly way too conservative for the hybrid signals when compared with the parameter estimation results and the assumption that 6 parameters are biased is not correct either.
A first observation is that semi-analytic models (here the representative IMRPhenomPv2 and SEOBNRv4 ROM models) were sufficiently accurate to analyze GW150914 during aLIGO's first observing run. This is hardly a surprise and has been studied in depth by comparing against NR simulations and waveform models by the LVC [19,49]. Fig. 2 also predicts that semi-analytical models will lead to biased parameter recovery at and beyond HLVK sensitivity for SXS BBH 0308 and at and beyond the A+ network for SXS BBH 0104. Moreover,   Table I). Each panel shows mismatch against network SNR and on the top x-axis the detector network (see Table II [45] of the SpEC NR simulations, the total mass and chirp mass in the source frame, the mass-ratio q = m2/m1 ≤ 1, the dimensionless spin vectors χi = Si/m 2 i of the BHs, the effective aligned spin and effective precession spin and the inclination angle between the total angular momentum J and the line of sight N . Signals are hybridized with SpinTaylorT1. Spin vectors are defined at a reference frequency of 30 Hz. We select the remaining common parameters to be ra = 1.949725, dec = −1.261573 (radians), a luminosity distance of dL = 562.59Mpc (which corresponds to a redshift of about z = 0.115), a polarization angle ψ = 1.4289. The GPS time at the geocenter was 1126259642.413 s and coalescence phase φcoa = 0. current NR waveforms will not be guaranteed to be sufficiently accurate for unbiased parameter recovery beyond the Voyager network (where the dark blue line intersects the dark green line). Clearly then current waveform models will not be accurate enough for 3G ground based detectors such as ET and Cosmic Explorer (CE) which are currently being planned. We will require waveform models to be at least three orders of magnitude more accurate, and improvements of one order of magnitude for NR waveforms. Fig. 2 presents a simplified picture to convey the main message that current waveform models are not accurate enough for planned 3G detectors. We now come back to some of the assumptions we have mentioned in Sec. III A and shed some light on details. The shape of the PSDs and the range in frequency over which particular interferometers are sensitive varies with the networks and in-fluences the value of the mismatch that enters the indistinguishability criterion. The horizontal lines shown in Fig. 2 provide a simplified representative measure of the error. For SXS BBH 0308 mismatches against IMRPhenomPv2 range from 0.002 (aLIGO O1) to 0.02 (CE) for the pure NR signal which is in band from 20Hz and above, and from 0.002 (aLIGO O1) to 0.008 (CE) for the hybrid signal. Starting frequencies are given in Table II We also indicate the frequency f low at which we start integrating the likelihood integral and the network SNR of the PN-NR hybrid signals in these networks (see Table I  . Text data files for the PSDs or ASDs can be found in [43,44] under the names given in this table. better 0.04 (CE) to 0.07 (AdvVirgo).
We want to stress that the mismatches depend very sensitively on the inclination angle under which the signal is seen. If we were to change the inclination for SXS BBH 0308 from near face-off, 2.7454, which is compatible with GW150914, to π/3 which emphasizes more harmonics content beyond the dominant (2, ±2) mode, then the mismatch is about an order of magnitude worse. If biases were to appear at the same SNR for this changed inclination this would make D an order of magnitude larger and the left panel of Fig. 2 would look markedly different and have stronger implications for how much waveform models need to be improved.
We have indicated in Fig. 2 the estimated accuracy of waveform models and NR simulations for the particular binary configurations by colored regions that are independent of the detector networks. These regions are supposed to give a rough sense of how accurate currently available models or codes are in the neighborhood of the BBH configurations considered here. Similar considerations as for the mismatches quoted above apply for the bounds of these regions. For simplicity we have bounded these very rough estimates by constant mismatch. Finally, detector calibration error depends on the detector network and is expected to improve over time up to a level that is believed to be attainable from current understanding. In Fig. 2 the estimate of the mismatch error due to detector calibration errors uses a realistic estimate for future detectors and assumes 1% relative error in amplitude, 1 • error in phase [85,86] and the additional assumption that the functional form of the dephasing from detector calibration errors can be modeled by a quadratic function which decreases towards high frequencies. Ultimately, the noise floor that comes from detector calibration error will only become problematic for 3G detector networks if we are not otherwise dominated by waveform errors.

Balancing accuracy using the full posterior
The above results used 1D marginal posterior distributions to calculate statistical and systematic errors and find the SNR at which they are comparable. We now discuss results where we take into account correlations between binary parameters and how they compare to the above. Irrespective of how many parameters we choose to include in the marginal posterior distribution we can always ask the question at which credible level the injection lies in the posterior distribution. We want to estimate when this is close to the 68th percentile. Since we only have data for fixed networks we need to interpolate the percentile values to estimate the SNR at which errors are balanced.
For the SXS BBH 0308 PN-NR hybrid signal we find the injection in the full posterior at the 2nd percentile for the O1 network and at the 100th percentile for HLVK, marginalizing over relative time and phase. For the marginal posterior in (M, q, χ eff , χ p ) we find the injection at the 12th and 100th percentile in O1 and HLVK, respectively. For the 1D marginal distributions in these parameters we find that the injection lies between the 4th to 50th percentile for O1 and between the 78th and 99th percentile in HLVK. Therefore, for this configuration we find a similar balance SNR of 60 for these different ways of computing the error balance. This estimate is somewhat uncertain, since we do not have any datapoints in between the O1 and HLVK networks. In terms of the prefactor D, we would expect to have D ∼ 8 if the key parameters are biased, but we find D ∼ 20 if the errors are balanced at SNR 60. We note that the chirp mass and the effective precession spin are quite biased for this signal. For the NR only signal we find D ∼ 30 because the mismatch is worse in the late inspiral and merger part.
For the SXS BBH 0104 source we find the injection in the full posterior at the 7th and 100th percentiles for the O1 and HLVK networks, respectively. For the marginal posterior in (M, q, χ eff , χ p ) we find the injection below the 40th percentile for O1, HLVK, A+, and Voyager, and it lies at the 100th percentile for the ET and ET-CE networks. For the 1D marginal distributions in these parameters we find that the injection lies between the 3rd to 43rd percentile for O1 and between the 12th and 40th percentile in HLVK. The balance SNR is then estimated to be ∼ 250. In combination with the large mismatch between the signal and template waveforms, it results in an enormous prefactor of D ∼ 10 4 . The reason is that there is almost no bias in (M, q, χ eff , χ p ), as can be seen in Fig. 4 discussed in Sec. III C. If we add inclination and distance parameters then there is a noticeable bias and we find the injection at the 50th percentile for O1 and at the 100th percentile for the HLVK network and beyond. This results in a more reasonable balance SNR of ∼ 22 and a prefactor of D ∼ 90. Using the 1D marginal errors we find a balance SNR of roughly 50 and a prefactor of D ∼ 450. The naive indistinguishability criterion with D = 8 predicts biased recovery at SNR 10, which is close to the SNR of the signal in the O1 network.

C. Parameter estimation results
We now turn to looking directly at posterior distributions for the analysis of the two mock BBH signals from Table I for a series of detector networks. Histograms and 90% credible regions for key parameters are shown in Figs. 3 and 4 for the SXS BBH 0308 and SXS BBH 0104 sources, respectively. Here we show IMRPhenomPv2 posteriors since this model includes approximate precession effects, in contrast to the non-precessing SEOBNRv4 ROM model.

Results
The posterior distributions of the detector-frame chirp mass shown in the top left panel of Figs. 3 and 4 become progressively tighter as we go to more sensitive networks, their widths scaling roughly inversely with the SNR. This is the expected behavior for a multi-modal Gaussian which the posterior distribution is expected to follow in the high SNR limit, although the 90% credible regions for some marginal 2D posteriors shown in the other panels are clearly not Gaussian. Only part of the chirp mass posterior is shown for O1 sensitivity so that we can more clearly see the posteriors for networks operating at higher sensitivities. The measurement precision in chirp mass in terms of the width of the 90% credible interval increases from ∼ 5M (O1), to 0.4M (HLVK), and 0.004M (ET+CE). The massive increase in precision for 3G detectors is expected due to the improved sensitivity and the significantly larger number of waveform cycles in the detector frequency band. For instance, for SXS BBH 0308 there are 64 cycles in band from 10 Hz, compared to 217 cycles from 5 Hz and 1025 cycles from 2Hz. For SXS BBH 0308 the O1 posterior is unbiased, with the true chirp mass value (red dashed line) near its peak. For the HLVK network and beyond the posteriors peak away from the true chirp mass. The true chirp mass is found at the 98th percentile for HLVK and for the Voyager network and beyond at the 100th percentile. For ET and ET-CE the chirp mass is underestimated by 0.18M . For SXS BBH 0104, there is again no visible bias at O1 sensitivity. For HLVK, the true value lies at one sigma away from the peak, and at the 92nd percentile for the Voyager network. Recovery is very accurate for the ET and ET-CE networks with a bias of −0.01M .
In the remaining panels of Figs. 3 and 4 we show 90% credible regions for marginal 2D posteriors for several key parameters, to give a sense of the correlations between binary parameters, starting with chirp mass and massratio. Compared to chirp mass, the mass-ratio is much more difficult to measure, resulting in very wide posteriors. This is especially true for the near equal mass SXS BBH 0308 source. There, the one-sided 10% percentile of the mass-ratio PDFs is roughly at 0.7 for 2G detectors. For 3G detectors the measurement is much more precise, again due to more inspiral cycles being observable, but in this case the mass-ratio is estimated to be too close to equal-mass with a bias q true −q MAP ≈ −0.15. For the unequal mass SXS BBH 0104 source, the massratio is much better measured. The measurement precision in terms of the 90% interval increases from 0.5 (O1) to 0.1 (HLVK) and 0.01 (ET-CE). Biases only appear for 3G detectors, where they are about −0.05.
During LIGO and Virgo's O1 & O2 observing runs precession effects have so far eluded measurement from compact binaries [1]. In terms of the effective precession spin parameter χ p [59,65] the posterior distributions shown in GWTC-1 have not provided new information compared to the prior distribution. We expect this situation to change with the improved sensitivity of future detectors [27,28] and the analysis of these sources is a case in point that we will be able to measure precession effects with future detectors. For SXS BBH 0308 the 90% interval for χ p shrinks from 0.7 (O1) to 0.5 (HLVK), 0.2 (Voyager) and 0.04 (ET-CE), while for SXS BBH 0104 it shrinks from 0.6 (O1) to 0.4 (HLVK), 0.2 (Voyager) and 0.004 (ET-CE). Beyond O1 sensitivity where the measurement is uninformative, SXS BBH 0308 posteriors are severely biased, overestimating χ p by about 0.6. The system is thus seen as close to maximally precessing while the averaged in-plane spin is only ∼ 0.3. For  Table I) using IMRPhenomPv2 as the template waveform for a sequence of detector networks (see Table II SXS BBH 0104 the χ p measurements are much more reliable and only offset by ∼ 0.1. Finally we show results for the marginal posteriors in luminosity distance d L and the inclination angle θ JN between the total angular momentum J of the binary and the line of sight vector N under which the binary is seen from the detector network. These two parameters are especially degenerate in how they affect the amplitude of the source and the 2D posteriors are in general not Gaussian which limits the usefulness of 1-dimensional interval estimates and biases. The inclination posterior can have a single mode as for SXS BBH 0308 which is seen close to the face-off inclination of the source, with some overestimation in θ JN and underestimation in the distance, or it can be bi-modal as for SXS BBH 0104 for networks with (close to) co-located detectors (O1, ET) which have a harder time constraining it to the correct mode. Networks with better coverage of the Earth (HLVK, A+, Voyager, ET-CE) obtain the correct mode, but the inclination angle is substantially underestimated along with overestimating the distance by a factor of about 2. The only network to recover the inclination and distance with good accuracy is the ET-CE network.

Discussion
In this section we provide a comparison between results obtained from two different waveform models, the agreement between these models and the source PN-NR waveforms and discuss the importance of limitations of the models in interpreting the parameter estimation results.
In this study we perform parameter estimation on signals in zero noise. This is a particular noise realization that can be interpreted as the average over all possible Gaussian noise realizations. It is an appropriate choice when one wants to focus on the effect of waveform systematics on posterior distributions. Therefore, any discrepancy we see between the posterior estimates and the true source parameters of the mock signals must be due to disagreements between the source and template waveforms or due to prior effects. Given that we use high accuracy NR or PN-NR waveforms as the signal, which    Table I) using IMRPhenomPv2 as the template waveform for a sequence of detector networks (see Table II are good approximations of GR waveforms, and we analyse high SNR events these disagreements are assumed to come from approximations to GR waveforms made in the waveform models we use as templates.
We performed the parameter estimation analyses with the IMRPhenomPv2 and SEOBNRv4 ROM IMR models. The assumptions made in these models are described in Sec. II. In Sec. III C 1 we presented results from the effective precessing IMRPhenomPv2 model. Here we juxtapose these results against the posterior distributions obtained for the aligned-spin SEOBNRv4 ROM model. In Table III we show medians and 90% credible intervals for selected source parameters and the two BBH sources for the O1, HLVK, and ET-CE networks. To gauge measurement accuracy we show absolute biases divided by the standard deviation in Table IV. We find that the two models give overall similar results for the parameter estimates. Noticeable differences are as follows: The chirp mass for the HLVK network is recovered more accurately for SEOBNRv4 ROM for SXS BBH 0308 compared to IMRPhenomPv2. Similarly, SEOBNRv4 ROM recovers the mass-ratio, effective aligned spin, luminosity distance and inclination angle with better accuracy than IMRPhenomPv2 for SXS BBH 0308 in the HLVK network. For SXS BBH 0104 in the HLVK network, SEOBNRv4 ROM does not recover the chirp mass very accurately, but finds the other selected source parameters with better accuracy than IMRPhenomPv2. For the O1 network all parameters except distance and inclination are unbiased. At HLVK sensitivity several parameters exceed unity in the modulus of the normalized bias, which indicates that the difference between true and MAP parameter value is larger than one standard deviation. The largest biases are found in the luminosity distance and inclination for SXS BBH 0104 recovered by the IMRPhenomPv2 model and for the effective precession spin χ p for SXS BBH 0308 found by IMRPhenomPv2.
Turning towards the ET-CE network we see in the size of the 90% intervals that measurement precision has increased dramatically, for instance the chirp mass is measured to ±0.002M , two orders of magnitude more accurately than for HLVK. The precision for the mass-ratio has increased by about one order of magnitude to roughly ±0.005 and similarly the effective aligned spin is measured to ±0.002 and the effective precession spin better than ±0.02. In the ET-CE network all parameters shown here have normalized biases exceeding unity in their absolute value. All of these parameters are estimated to lie outside one standard deviation for the two waveform models employed here, making it clear that waveform models need to be improved for analyses with 3G detectors.
The PN-NR signal waveforms we used to represent the GWs emitted by the source binaries contain higher harmonics beyond = 2, but the waveform models used as templates only include the dominant quadrupolar modes. In fact, the models do also not include all of the m modes at = 2, but merely the = 2, m = ±2 contribu-tions in the co-precessing frame. This begs the question how much the missing higher modes affect the analyses. In terms of SNR ρ for SXS BBH 0308, 99.5% of ρ 2 is found in the (2, ±2) mode (ignoring precession), while for SXS BBH 0104 95.6% of the total ρ 2 is found in the (2, ±2) mode, and 3.8% in the (3, ±3) mode. The above percentages are stated in terms of ρ 2 as SNR adds in quadrature. The overlap between a signal with and without higher harmonics at the signal parameters is 0.9997 for SXS BBH 0308 and 0.96 for SXS BBH 0104, which illustrates that higher modes only become important for higher mass-ratios. To compute these numbers we used the SEOBNRv4 ROM [61,67,68] and a SEOB-NRv4HM ROM [105] waveform models and aLIGO design sensitivity with a starting frequency of 10 Hz. Computing the detector response (4) and optimizing over the polarization angle for the template waveform while keeping sky location fixed yields overlaps of 0.9993 and 0.96, instead. For SXS BBH 0308 (SXS BBH 0104), the overlap between an NR waveform that includes all = 2 modes vs a waveform that only includes the (2, ±2) modes in the co-precessing frame is 0.99996 (0.99992), or 0.9992 when optimizing over the polarization angle. This shows that the for these configurations the (2, ±1) modes in the co-precessing frame are very weak.
As we have seen in Sec. III B, overlaps between the PN-NR signals and the two waveform models at the signal parameters for aLIGO design are significantly lower than the overlaps which include or leave out higher modes. They are 0.97 (0.91) for the SXS BBH 0308 hybrid signal and 0.91 (0.94) for SXS BBH 0104 using IMRPhenomPv2 (SEOBNRv4 ROM). This indicates that the disagreement between the signal and template waveforms comes predominantly from modeling error in the co-precessing frame ( , m) = (2, ±2) mode. Some disagreement could also come from the approximate description of precession.
For all analyses presented in this study, we have assumed that the zero noise data still carries an inherent uncertainty in its calibration, as described in Sec. II E. This uncertainty is modelled as a cubic spline, enforcing a smooth variation across the bandwidth of the analysis. In principle, by allowing this additional degree of freedom which could absorb some of the mismatch between the PN-NR hybrids and the approximate waveform models used in the PE analysis the observed biases could be expected to be reduced. Comparing the 1D posterior distributions shown in Fig. 3 to distributions from analyses where the marginalisation over calibration uncertainties has been disabled, the observed biases remain. It should be noted that the analyses which includes marginalisation over calibration uncertainties systematically recovers a slightly higher SNR accumulated over the detector network, but as this increase is of order 1/1000 this is not expected to affect the conclusions with any significance.

IV. POPULATION STUDY
We have seen in Sec. III that in the HLVK design network we already expect biases with current waveform models for loud BBHs such as GW150914. Even small biases found for weaker single events could still manifest themselves when estimating properties of the population of BBHs [106][107][108][109]. In this section we perform a PE analysis for a population consisting of one hundred high mass precessing BBH events. On the one hand we study the distribution and correlation of parameter biases and compute the overall bias over the population. On the other hand we analyze the residual between the signal and the best matching template waveform, in terms of its SNR, power in the time frequency plane, and in terms of Bayes factors between analyses assuming coherent and incoherent signals across the detector network as implemented with BayesWave [110,111]. Finally, we compute the population posterior for the power law index of the primary mass of the source binaries.

A. Setup
The events in this population study were drawn from the following distribution of source parameters: The primary mass has a PDF p(m 1 ) ∝ m −α 1 with α = 1.3 and m 1 ∈ [45, 50]M and the mass-ratio is distributed as q ∼ U (0.5, 1). The chirp mass M = M tot η 3/5 is computed from (m 1 , q), where M tot = m 1 + m 2 and η = q/(1 + q) 2 . The remaining parameters are distributed as follows: spin magnitudes a i ∼ U (0, 0.8), spin tilts cos t i ∼ U (−1, 1), the azimuthal angle between the spin vectors φ 12 ∼ U (0, 2π), the angle between the total and orbital angular momentum φ JL ∼ U (0, 2π), the inclination angle cos θ JN ∼ U (−1, 1), the polarization angle ψ ∼ U (0, π). The luminosity distance, geocenter time, sky location, and phase were fixed at the parameters given in Table I.
Since NR waveforms are only available at isolated points in parameter space and thus cannot well represent the above distribution we choose the NRSur7dq2 NR surrogate model for the signal waveforms [74]. This choice implies restrictions to mass-ratio q ≥ 1/2, spin magnitudes a i ≤ 0.8 and, due to the relatively short waveform length, the constraint on the primary mass m 1 ≥ 45M , so that waveforms representing the BBH population start at or below 20 Hz. We perform PE analyses with the Bilby code [77,78] with signals in zero noise and IMRPhenomPv2 templates for the HLVK network.

B. Bias
We can learn about how population parameters will be affected by studying correlations between biases in key source parameters for events drawn from a population and to what degree single event biases average out over the population. In Fig. 5 we show absolute biases, defined as the difference between the true source parameters θ i s and a point estimate of the posterior distribution θ i p for event i As a default we use the MAP value of the posterior distributions as the point estimate. We see that biases are large when the signal is represented by NRSur7dq2 waveforms and the template by the IMRPhenomPv2 model. In contrast, when the signal and template are represented by the same IMRPhenomPv2 waveforms there is only a small discrepancy between the MAP and the true signal parameters which is expected to arise from stochastic sampling and prior effects. Here the posterior distribution is dominated by the likelihood since the signal SNRs are high. We find that log-likelihood values come close, but are a bit lower than, the peak value of the log-likelihood at the signal parameters. While the MAP (or equivalently maxL) parameters are a bit different than the signal parameters, the deviations in the waveform are tiny and the SNR in the residual is on the order of one. The spread is a factor 7 smaller in chirp mass, a factor 4 smaller in effective aligned spin and a factor 2 smaller in massratio. For both types of signals we observe pronounced correlations between these parameters which we expect on physical grounds due to how these parameters enter the inspiral waveform [19,28,80,[99][100][101][102][103][104]. We find Pearson correlation coefficients of R M,χ eff ∼ 0.8(0.8) and R M,q ∼ 0.5(0.3) for NRSur7dq2 (IMRPhenomPv2) signals. For NR-surrogate signals the chirp mass shows a clear tendency to be overestimated. This is also true for effective spin and mass-ratio. In contrast, for IMRPhenomPv2 signals the distribution of the single event biases is more symmetrical. We also see that for NR-surrogate signals heavy binaries are prone to overestimation of χ eff as indicated by the luminosity of the red pentagons.
In Fig. 7 we see that bias in the luminosity distance tends to be negative, and with the definition in Eq. 10 this implies that the distance is overestimated in inference as a rule. The distance bias is reduced by about half for IMRPhenomPv2 signals, compared to NRSur7dq2 signals, but it is still sizable. In contrast we find relatively small biases of about 10 • in the inclination angle. To get a better sense of how much these biases matter we discuss the distribution of the ratio R of absolute biases and 90% credible intervals shown in Fig. 6, where we divide the bias by the extent of the 90% credible interval for each event and parameter. For NR-surrogate signals |R| reaches unity for the chirpmass, takes values up to 2 for the effective aligned spin, and about 1.5 for the mass-ratio which indicates that the parameter recovery is strongly biased. The choice of comparing to the 90% interval is more conservative than to 1 − σ which is assumed in the indistinguishability criterion. In contrast, for IMRPhenomPv2 signals |R| is smaller than 0.4 for all parameters and the majority of events are found with very good accuracy |R| 0.1. We show the overall bias over the population in Table V. For NR-surrogate signals the largest population bias is seen for the MAP. Using the mean or median as a point estimate the overall bias is significantly lower than when using the MAP. This is not the case for IMRPhenomPv2 signals, where the largest bias is found for the mean. We also show the sum of ratios of the biases over the 90% intervals, i R i . The size of this quantity shows more clearly how severe the biases are overall averaged over the population. Again the sum of the biases is much larger for the NR-surrogate signals, about 10 -30 times larger than the sum of 90% credible intervals for key astrophysical parameters. The magnitude of i R i for IMRPhenomPv2 signals is about ∼ 2 indicating that there is no significant bias when combining all events in the population. We will revisit the question of how population estimates are affected in Sec. IV D.

C. Residuals
We previously discussed biases found for events in the BBH population study. The biases stem from a disagreement between the signal NR-surrogate waveforms h s (t; θ s ) and IMRPhenomPv2 template waveforms h m (t; θ s ) used in the analysis at the source parameters θ s . This disagreement will also lead to some residual power being left over after subtracting the data containing the signal from the best fit template waveform, h m (t; θ MAP ). Here we discuss how this residual power can be characterized in terms of SNR and power in the time frequency plane. We also perform an analysis with BayesWave.
In Fig. 8 we show the network SNRs found in the signal strain h s (t; θ s ) and in the residual strain h s (t; θ s ) − h m (t; θ MAP ). In each detector of the network we compute the strain by projecting the waveform polarizations on the detector as defined in Eq. (4). We observe that residuals reach SNRs of about 12, expect for one event with residual SNR ∼ 18.37. Parameters for this event are shown in Table VI. We find residual SNRs up to 30% of the signal SNR. The log-likelihood at MAP is highest for events where the agreement between the signal and template waveforms is good and thus the residual is small, and it drops substantially for events where the residual contains a sizable fraction of the signal SNR. For the event with the highest residual SNR the biases are only moderate ∆M = −0.27M , ∆q = 0.11, ∆χ eff = −0.02, and ∆χ p = −0.04, but it has a high signal SNR 87.91.
Next we take a look at the power in the time frequency plane and compare the loudest residual against a chirp signal. In Fig. 9 we plot the Q-transform [112] using PyCBC [113] of the residual with the highest SNR. As shown in the left panel, the power in the residual in LIGO Hanford traces out a chirp signal and agrees well with the overlaid time frequency evolution of the waveform emitted by the source. The right panel shows that the coherent power in the residual (taking into account timeshifts for each detector) is about a factor 5 larger than the power of the loudest single detector residual. Most of the coherent residual power is concentrated near merger where the GW signal is most non-linear.
Finally, we analyse the residual strain across the detector network using the BayesWave [110] code, assuming no pre-defined signal model apart from constraining signals coherent across the detector network to an elliptical polarization. Here, the waveform is reconstructed directly, through a superposition of Morlet-Gabor, or sine-Gaussian, wavelets [110,114], where the number, placement and properties of the wavelets are themselves variables in the analysis. For this study, we compare two competing models for the observed residual data [111]. The coherent model assumes a common waveform across the entire network, as originating from a point in the sky and projected onto each detector assuming standard antenna pattern functions for the two tensorial polarization modes as defined in Sec. II E [115]. The incoherent model assumes complete independence between the observed signals across the network. Instead of the data being represented through a common set of wavelets projected onto the detectors this model constructs a separate waveform for each detector where the placement and structure of the wavelets is independent from other detectors and no phase and time coherence across the network is required. The two models [116] can then be directly compared through a Bayes factor for each set of residual strains as shown in Fig. 10. As BayesWave is constructed, it has a strong dependance on signal complexity, as opposed to simply depending on signal strength only, in order to make observational claims such as for example preferring a coherent description of the signal over an incoherent one [117]. This means that the Bayes' factors inherently incorporate the Occam factor between the two models, where the incoherent model can require a larger number of wavelets (and hence a larger number of signal parameters) to reconstruct the data across the network as it does not need to consider extrinsic parameters (sky position and two angles describing the polarization and ellipticity of the gravitational wave). For the set of residual strains in this study, we often find the incoherent model incapable of capturing the signal in an individual detector, with a median number of 0 wavelets per detector. The coherent signal on the other hand always captures the common signal, but even here the median number of wavelets is "only" 1. We interpret this as BayesWave being consistently able to determine that there is something originating from a common coherent source in the data, but due to the relatively low SNR we are not generally able to make strong inference on the physical description of what this coherent signal would be. Even so, we argue that this type of analy-sis will be a valuable tool in determining the power and accuracy of future modelled inference [21,118], and can ensure that all of the observable signal can be captured and characterized. Note that the analysis here is performed in a noise-free set of data, assuming a known and fixed set of detector sensitivities shown in Fig. 1. For "real" data, the presence of time-varying random Gaussian noise [119], as well as actual detector glitches [120], is expected to reduce the fidelity of this category of tests, however BayesWave is already capable of accounting for such variance [40,117]. The level to which variations in data will affect a study of residual recovery will be left for future investigation.

D. Population inference
We follow the hierarchical Bayesian inference method described in Sec. II E 1. We show the hyper-posterior for α, the power-law index of the primary BH mass in Fig. 11, where we assumed a hyper-prior π(α) ∼ U (1, 2). Unfortunately, the PDFs of the power-law distribution for the true value and the boundary values of α are rather similar over the narrow mass interval considered here. This is probably due to the rather tight lower mass bound which is set by the finite length of the NRSur7dq2 waveform model. Given that there is not much information in the hyper-posterior we ask the question whether we prefer α = 1 or α = 2. Clearly, α = 1 is preferred by the hyper-posterior. This agrees with the observation that in the single event posterior PDFs we overestimate the masses (see Fig. 5).

A. Summary of Results
In this study we have looked at the impact of inaccuracies in models of the GW waveform on inferring parameters for single loud events and for populations of binary black holes (BBHs). In Sec. III we presented results from parameter estimation analyses with current waveform models (IMRPhenomPv2, SEOBNRv4 ROM) for two simulated PN-NR signals at fixed luminosity distance for a series of detector networks. These "golden binaries" are therefore observed with increasingly high SNR as we look towards future detectors which are about a hundred times more sensitive than the current ones. From the posterior distributions we calculated systematic and statistical errors and produced a tuned version of the indistinguishability criterion (see Eqns. (8) and (9)). In Fig. 2 we show the resulting "acceptable error" as a function of SNR. The main result of this paper shows that current waveform models used as templates in our PE analyses need to be improved for aLIGO design sensitivity and beyond: For 3G detectors such as Cosmic Explorer and the Einstein Telescope, the mismatch error for semi-analytical models needs to be reduced by three orders of magnitude and by one order of magnitude for NR waveforms.
In Sec.III C 2 we saw that waveform inaccuracies can come from a combination of factors: errors in the dominant (2, ±2) modes in the co-precessing frame, approximate modeling of the precessing reference frame of the binary, and from missing higher harmonics in the waveform. Better semi-analytical models that include more physics are becoming available [72,73,105,[121][122][123][124].
It stands to reason that if inferred binary parameters for single events are affected by inaccuracies in waveform models that these deficiencies will also impact the analysis of populations of compact binaries. In populations, many events will be significantly weaker than the loud "golden binaries" we have considered before. Still, many small errors may sum up to give a sizable effect that can impact analyses. Therefore, in Sec. IV we presented a study for one hundred high mass BBHs mock signals (either NRSur7dq2 or IMRPhenomPv2) drawn from an astrophysically motivated distribution in the intrinsic parameters. We again performed PE with the semi-analytical IMRPhenomPv2 model for the aLIGO-Virgo-KAGRA design sensitivity network.
In Fig. 6 we find that parameter biases between key parameters such as chirp mass, mass-ratio, and effective spin are strongly correlated, the population sum of these biases is nonzero for the NRSur7dq2 signals, and the largest parameter biases lie outside 90% credible intervals. Posteriors for IMRPhenomPv2 signals still show the correlations, but as shown in Table V the population sum of their biases is close to zero.
The residual between the GW data recorded by a detector and the best fit template waveform obtained from PE can be analysed further. If the waveform template cannot capture all the features in the signal then the residuals (for the detectors in a network) will contain some coherent power (i.e. the residuals are not just due to random noise fluctuations in each detector). We show that this is the case for a NRSur7dq2 event in our population study which has significant SNR in its residual (see Fig. 8), and significant power in the time frequency plane (see Fig. 9). We have also carried out a BayesWave analysis and show in Fig. 10 the Bayes factors between a coherent and an incoherent wavelet model for the population events. Most events have a residual SNR of about 12 and Bayes factors of about 60 in favor of the coherent model while the event with the loudest residual has a residual SNR of 18 and BF of 4 × 10 11 .
We have computed the population hyper-posterior of the power-law index of the larger BH, the only free parameter in the distribution of source parameters. Due to the shortness of the signals in time, the hyper-posterior is not very informative, but it shows preference for the lower bound of the prior α = 1 over then upper bound α = 2, and is thus closer to the true value α = 1.3.  (10) and (11)) for chirp mass, mass-ratio, effective aligned spin and effective precession spin. The point estimate θp is either the mean, median, or MAP. We show biases for NRSur7dq2 and IMRPhenomPv2 signals.   Fig. 9.

B. Outlook
Let us discuss several further implications of systematic errors in measured binary parameters caused by inaccuracies in waveform models. They concern the astrophysical relevance of biases. the future of waveform modeling and NR simulations, and how tests of GR will be affected.
In this study we have reported extensively about biases in inferred binary parameters. How much should we care about these biases? Beyond the simple statement that parameter biases will matter more when they are large, we would like to point out particular situations when biases are especially important and can severely impact the interpretation of GW observations. Severe biases could cause a misidentification of the class of a compact binary, e.g. confusing BNS, NSBH, BBH sources near the lower mass gap [125,126]. Large biases in spin parameters such as the effective precession spin χ p could lead to a misidentification of formation channel of a binary. This could also happen if the effective aligned spin parameter χ eff was heavily biased, but in general χ eff measurements are a lot more robust since this parameter is connected with the length of the inspiral signal [104]. We have seen in Fig. 3 that χ p can indeed be significantly underestimated, especially if the precession modulations are suppressed when the binary is viewed nearly face-on or face-off.
Extrinsic parameters are in general less affected by waveform systematics and we do not expect their measurement errors to have a big impact. Sky location parameters enter in the detector pattern functions and should not be affected. We expect luminosity distance measurements to be affected mainly through their corre-lation with the binary's inclination angle. The latter can be better measured [121,122,[127][128][129] when the waveform includes higher harmonics beyond the dominant (2, ±2) modes. Amplitude errors should play a lesser role than phase errors, which could lead to us to misestimate M and thus bias the recovered distance. If we misestimate M due to phase errors, that will also bias the recovered distance. Through this correlation misestimation of distance can lead to additional bias on the source-frame masses. This can be significant for very distant binaries. Finally, based on the discussion in IV B we expect that for population analyses parameters characterizing the mass and spin distributions will be affected to some degree since the events making up the population will suffer some amount of parameter biases.
How can waveform models be improved and made ready for the planned future 3G detectors, such as Cosmic Explorer and Einstein telescope? On the one hand the accuracy in the inspiral regime needs to be improved. This requires a higher order and more complete PN description and further work on re-summation and effective-one-body theory to extend the validity of the inspiral to higher frequencies. The inclusion of self force terms into effective-one-body (EOB) could help accuracy for large mass ratios [130]. Post-Minkowskian results obtained with modern scattering amplitude methods could be useful to improve the accuracy, if pushed to higher order [131,132]. PN calculations have being made at 4PN order for non-spinning BBHs [133][134][135]. Practical semianalytic inspiral-merger-ringdown waveform models for BBHs, whether they are phenomenological or EOB models, require more NR waveforms covering larger parts of the binary parameter space and ultimately higher NR accuracy. Especially for unequal mass-ratios we will also require longer NR simulations in time in order to be able to combine NR waveforms with PN or EOB inspirals to form highly accurate hybrid waveforms [55][56][57], and to better determine inspiral coefficients in the construction of EOB models [61]. So far semi-analytic models have been tuned only in the non-precessing sector. Extending calibration as more precessing NR simulations are becoming available will be essential to improve their accuracy. In addition, a novel NR-independent, analytical approach for modeling the merger has been put forward [136]. The accuracy of this approach beyond NR accuracy could be assessed with constraints on waveforms obtained from balance laws at future null infinity [137]. Surrogate and reduced order models of NR waveforms [74,[138][139][140][141] and of EOB waveform models [61,67,68,142,143] have come to prominence in the past several years. They preserve the accuracy of the training set waveforms they are constructed from and are orders of magnitude faster to evaluate making them crucial for data analysis applications. They depend on their input data and so their accuracy is limited by the accuracy of the training set waveforms, and the requirement that the training data is sufficiently dense in the parameter space, since they need to fit or interpolate waveform coefficients over parameter space.
Waveform models should also include all physical effects that will leave a measurable trace in the emitted GW signal. This includes spin effects (aligned and precessing spins), higher harmonics beyond the dominant (2, ±2) modes in the waveform, imprints of eccentricity, and tidal effects if the binary contains at least one neutron star. As the number of waveform parameters increases it becomes harder to carry out enough NR simulations to accurately tune models. A further desirable improvement for waveform models is to also model internal errors in waveform models and marginalize over these parameters in PE, which can be achieved with Gaussian process regression (GPR) [144][145][146][147][148][149]. Posterior distributions obtained with such models should be more accurate (reduced bias) but somewhat less precise.
We have seen that NR waveforms are central for IMR waveform modeling. According to our results, NR waveforms will have to be improved in the future along three different dimensions: First, accuracy; second, length; third improved parameter space coverage. It turns out that each of these aspects will make simulations more expensive. Regarding length, the cost of an NR simulation is at least proportional to the time-to-merger; hence the cost will increase as 1/η (M Ω i ) −8/3 , so that starting at one half the initial (orbital) frequency M Ω i increases cost by at least a factor of 5. This scaling of the timeto-merger already indicates that making the mass-ratio more unequal will also increase the computational cost: The time-to-merger (and this computational cost) will increase at least as 1/η. In addition, current NR codes use explicit time integration and are therefore limited by the Courant condition, so that each time-step can cover at most a time-interval ∝ q (for q ≤ 1), giving a second power of the mass-ratio. Regarding accuracy, it is difficult to predict how the achieved accuracy scales with computational cost; one estimate for SpEC is that the cost goes as −1/3 [150], where is the NR error. Therefore, reducing the mismatch-error by a factor of 10 -at same parameters and length of the simulation -increases computational cost by about 50% for SpEC since the mismatch error goes as the square of the NR error [88,90,151]. Finally, both higher spins and higher mass-ratio make NR simulations more expensive, with the mass-ratio dependence most pronounced.
The accuracy and number of NR simulations have improved dramatically since the breakthrough in 2005 [93,152,153]. What improvements can we expect for the future that can deliver the simulations needed to be ready for 3G science? We can no longer rely on Moore's law to deliver massive improvements of CPU clock speeds. Instead advances in CPU development have shifted to increasing the number of cores and to exploit that NR codes need better parallelization and scaling. New codes are being developed to address these accuracy and performance issues. The SpECTRE code [154,155] from the SXS Collaboration uses task-based parallelism combined with the discontinuous Galerkin method to significantly increase the efficiency and scalability of relativistic astrophysics simulations. Work to significantly reduce computational cost for NR simulations is also under way for finite difference codes [156]. These approaches could lead to a two order of magnitude improvement in efficiency and bring us closer to solving the problems we have pointed out here. In addition to the truncation error which results from the finite degree polynomial approximations to continuum derivatives in Einstein's equations, errors are made when extracting the GW waveform on computer grids extending finite distances away from the merging binary. Traditionally, the waves are extracted (ideally on spherical shells) at several radii as far away from the origin as possible and the ideal waveform at future null infinity is extrapolated from that data. The Cauchy characteristic extraction method [157][158][159][160] can compute the emitted GWs with higher accuracy and should be available for future NR simulations. Combing waveforms from SpEC and finite difference codes by hybridization is a promising technique for especially challenging configurations [161].
Our study on the impact of waveform inaccuracies should be extended to tests of GR which we expect to be especially susceptible to systematic effects which could be misinterpreted as genuine deviations from GR. All of the current tests of GR [21] should be scrutinized. This includes tests on the distribution of the SNR of residuals in detector noise, testing whether the final mass and spin inferred from the low and high frequency parts of the GW signal are consistent, computing posterior distributions of deviations in e.g. PN waveform coefficients, computing posteriors on parameters in phenomenological dispersion relations and tests that put constraints on alternative GW polarizations. Ultimately, tests of GR should be  IMRPhenomPv2 MAP template strain hm(t; θMAP) for each detector in the HLVK network compared to the SNR of the respective signal strain. Left: Residual SNR as a function of signal (injection) SNR with fractional SNR indicated by color for the events in the population study. Right: Delta Log-likelihood d|h − 0.5 h|h at MAP against fractional SNR colored by the mass-ratio bias. The bias in mass-ratio is indicated by color. The event with the highest residual SNR is indicated by a red circle around the marker. The parameters for this event are given in Table VI. done by estimating parameters of waveform models for alternative theories of gravity, along with Bayesian model comparisons. Work is under way to identify well-posed alternative theories of gravity [162][163][164][165][166] and to numerically compute what the emitted GW will look like in the strong field regime [167].
Finally, we expect that LISA analyses of massive BBHs, which are should have SNRs of hundreds to thousands, will be affected in similar ways as demonstrated here for 3G ground-based detectors [22]. Updated estimates for current IMR waveform models will need to be explored in future studies.

ACKNOWLEDGMENTS
This work was stimulated by the Gravitational Wave International Committee (GWIC) 3G science-case study [168]. The  We construct hybrid waveforms by combining multimodal precessing PN and NR waveforms using the GWFrames [53] code. The code first reads the NR waveform data and transform it to the co-rotating frame [54] and shifts it in time so that the merger lies at t = 0. Data for the evolution of the positions, masses and spin vectors of the BHs as determined by locating their apparent horizons in the NR code is read in as well. Next, the separation vector between the two BHs and the orbital frequency are computed, along with the rotor of the reference frame at the relaxed time (after the junk radiation has passed).
We compute a PN waveform from the PNWaveform package included in the GWFrames code. The PN implementation includes nonspinning orbital binding up to 4pN [169]. The 5pN term is set to zero. Spin-orbit terms in the angular momentum are included up to 3.5pN [170].
Nonspinning flux terms are included up to 3.5PN [169], and higher-order terms from [171] up to 6 PN along with absorption terms from [172]. Spin-spin and spinorbit squared terms at 2PN order are included [173][174][175] and spin-orbit terms in the flux are included up to 4.0PN [176]. Precession of the orbital angular velocity and spins follows [170,173,177]. Expressions for waveform modes are taken from [178][179][180][181]. We use the SpinTaylorT1 and SpinTaylorT4 implemented in this code which are simply called TaylorT1 and TaylorT4 there, but we add the prefix Spin to make it clear that they support precession.
Initial data for the PN integration is set at the NR relaxed time and the PN equations are also evolved backwards in time to the desired starting orbital frequency M Ω i . The PN waveform is then transformed to the corotating frame. To prepare for hybridization, the PN and NR waveforms are aligned by minimizing the distance between their rotors in their co-rotating frames. The aligned waveforms are then blended and hybridized.
In this study we choose M Ω i = 0.002 due to computational restrictions. This corresponds to f GW ≈ 3.5 Hz for the (2, 2) mode. Higher ( , m) modes in the waveform enter the frequency band at m/2 the frequency at which the (2, 2) mode enters. Therefore, some of the higher harmonics are truncated at low frequencies but this effect is minor because they are very small compared to the dominant modes.
To use the LVC NR-injection infrastructure [182] we also hybridize dynamics quantities, namely the spin vectors, orbital frequency, the Newtonian orbital angular momentum vector, the vectorn pointing from one BH to the other, and the position vectors of the BHs. This allows us to define the spin vectors at a particular reference frequency and to output the result in "LVC NR" format.
Figs. 12 and 13 show selected waveform modes, the phase of the (2, 2) mode, the orbital angular momentum vector and the spin vectors for the two configurations used in this study. These plots demonstrate the good blending between the PN and NR data in the hybridization time region (gray shaded). The absolute value of inertial frame modes for the hybrids are shown in 14.
Higher harmonics are stronger for the more unequal mass SXS BBH 0104 configuration, whereas precession effects that give rise to modes like the (2, 1) and (3, 2) mode are stronger for SXS BBH 0308.
For SXS BBH 0308 there is a disagreement in the corotating frame (2,1) mode between PN and NR. This mode is weak since the system is almost equal mass which likely exacerbates the disagreement. In contrast, we find excellent agreement in the same mode for SXS BBH 0104. The effect of this discrepancy for SXS BBH 0308 is very small. The mismatch between the hybrid with and without the co-rotating frame (2, ±1) modes is on the order of hybridization error and NR error, about ∼ 10 −5 .
To study the error introduced by hybridizing PN and   Table VI), between a coherent (assuming the same incoming signal in all participating detectors) and an incoherent (where the signal in each detector is assumed independent) model [111]. Both the coherent and incoherent models are constructed from a superposition of Morlet-Gabor wavelets, where the number of wavelets is in itself a variable, as implemented in BayesWave [110]. This shows unequivocaly that although the modelled analysis, where a known GR waveform approximants attempts to match the signal that best matches what is observed across the detector network, there is a significant fraction of observable coherent signal left. The properties of this left-over signal are not strongly constrained by this analysis however, as expected by the typical SNR ∼ 12 for these residual signals. The excluded event, with properties listed in Table VI, has a Bayes factor of ∼ 4 × 10 11 . NR waveforms the optimal test would be to compute the mismatch of a hybrid against a very long high accuracy NR waveform that fills the detector band. Since this is in practise not possible we perform the following experiments to make sure that hybridization errors are subdominant. We compute overlaps between (a) a reference hybrid in the time window t = [200, 800]M , measured from the beginning of the NR waveform, and "sliding hybrids", a series of hybrids blended with 100M long time windows that approach the merger in discrete steps. We also compute (b) overlaps between sliding hybrids for the same window starting time between the SpinTaylorT1 and SpinTaylorT4 PN approximants. The reference hybrids are used as a signal waveforms for PE in the main study of the paper. We compute the max-max overlaps as defined in App. B of [70]. We show the resulting overlaps for SXS BBH 0308 and SXS BBH 0104 in Fig. 15. Both curves show that if one hybridizes early the mismatch is small and noisy. These mismatches are lower than the mismatches between different NR resolutions quoted in Sec. II B. Therefore hybridization errors are subdominant for these configurations. For SXS BBH 0308 the mismatch only rises beyond 10 −4 for windows that start within 500M of the merger, while for SXS BBH 0104 the mismatches approach 10 −3 already 1000M before merger. In this regime PN waveforms become inaccurate compared to NR and differences between PN approximants grow.
We also want to briefly mention additional sources of errors. Spin vectors are defined differently in PN and NR [183][184][185] and therefore, using the same spin values for both waveforms at the same time as we do in the hybrid construction will introduce an additional error that we do not quantify here. The m = 0 "memory" modes may not be accurate without using Cauchy characteristic extraction (CCE) [158]. The waveforms used in this study, SXS BBH 0308 and SXS BBH 0104 do not use CCE.
The configurations considered in this study are fairly easy to hybridize and one should not infer a general behavior of hybridization errors from them. For more challenging configurations (higher mass-ratios and spins) PN and NR are expected to show discrepancies further away from merger. How long NR waveforms need to be so that hybridization errors are subdominant requires detailed study [55][56][57][58].    Here, hybrids are constructed from an orbital frequency of M Ω = 0.01 and the overlap integral starts at 10 Hz and uses the aLIGO design PSD. (Therefore, higher harmonics will be incomplete in the frequency band, but in a consistent manner. Degradation of the mismatch will come from high frequencies close to merger.) The blue curves show mismatches between SpinTaylorT1-NR hybrids constructed with a 100M hybridization window as a function of the start time t1 of this window before merger against a reference hybrid with a broader window at t = [200, 800]M measured from the beginning of the NR waveform, and after the relaxation time. The binaries merge at 2822.24M and 5196.2M from the beginning of the NR simulations, respectively for SXS BBH 0308 and SXS BBH 0104. The orange curves show mismatches between SpinTaylorT1-NR and SpinTaylorT4-NR hybrids constructed with the same 100M hybridization window as a function of the start time t1.