Measuring binary black hole orbital-plane spin orientations

Binary black hole spins are among the key observables for gravitational wave astronomy. Among the spin parameters, their orientations within the orbital plane, $\phi_1$, $\phi_2$ and $\Delta \phi=\phi_1-\phi_2$, are critical for understanding the prevalence of the spin-orbit resonances and merger recoils in binary black holes. Unfortunately, these angles are particularly hard to measure using current detectors, LIGO and Virgo. Because the spin directions are not constant for precessing binaries, the traditional approach is to measure the spin components at some reference stage in the waveform evolution, typically the point at which the frequency of the detected signal reaches 20 Hz. However, we find that this is a poor choice for the orbital-plane spin angle measurements. Instead, we propose measuring the spins at a fixed dimensionless time or frequency near the merger. This leads to significantly improved measurements for $\phi_1$ and $\phi_2$ for several gravitational wave events. Furthermore, using numerical relativity injections, we demonstrate that $\Delta \phi$ will also be better measured near the merger for louder signals expected in the future. Finally, we show that numerical relativity surrogate models are key for reliably measuring the orbital-plane spin orientations, even at moderate signal-to-noise ratios like $\sim 30-45$.

Binary black hole spins are among the key observables for gravitational wave astronomy. Among the spin parameters, their orientations within the orbital plane, φ1, φ2 and ∆φ = φ1 − φ2, are critical for understanding the prevalence of the spin-orbit resonances and merger recoils in binary black holes. Unfortunately, these angles are particularly hard to measure using current detectors, LIGO and Virgo. Because the spin directions are not constant for precessing binaries, the traditional approach is to measure the spin components at some reference stage in the waveform evolution, typically the point at which the frequency of the detected signal reaches 20 Hz. However, we find that this is a poor choice for the orbital-plane spin angle measurements. Instead, we propose measuring the spins at a fixed dimensionless time or frequency near the merger. This leads to significantly improved measurements for φ1 and φ2 for several gravitational wave events. Furthermore, using numerical relativity injections, we demonstrate that ∆φ will also be better measured near the merger for louder signals expected in the future. Finally, we show that numerical relativity surrogate models are key for reliably measuring the orbital-plane spin orientations, even at moderate signal-to-noise ratios like ∼ 30 − 45.

I. INTRODUCTION.
Binary black hole (BH) spins leave characteristic imprints on the gravitational-wave (GW) signals observed by LIGO [1] and Virgo [2]. Measuring the spin parameters (illustrated in Fig. 1) from these signals will allow us to identify which astrophysical processes play a role in the binary evolution. For example, if the spins are tilted with respect to the orbital angular momentum, spin-orbit and spin-spin coupling cause both the spins and the orbital plane to precess [3,4]. On the other hand, the orbitalplane spin orientations, φ 1 , φ 2 and ∆φ = φ 1 − φ 2 , can be used to identify spin-orbit resonances [5] and infer merger kick velocities [6].
Unfortunately, measuring the individual spin degrees of freedom from GW events is challenging at current detector sensitivities. This is particularly true for the orbital-plane spin angles φ 1 , φ 2 and ∆φ [7][8][9] (although see Refs. [10][11][12]). For instance, these measurements are typically not shown in LIGO-Virgo Collaboration (LVC) publications (e.g. Ref. [13]) as they are very poorly constrained. In this paper, we show that this can be significantly improved by a simple change in the reference point at which the spins are measured.
Because the spin directions are not constant for precessing binaries, spin measurements are inherently tied to a specific moment in the binary's evolution. In practice, the spins are measured by varying the spin parameters of a GW model at a given reference point in the inspiral and matching the predicted signal to the observed data (cf. Sec. II). The traditional approach is to measure the spins at the point where the frequency of the GW signal at the detector reaches a prespecified reference value, typically f ref = 20 Hz [13]. This is mainly motivated by the fact that the sensitivity band of current detectors begins near this value [1,2].  GW150914  GW170729  GW170809  GW170818  GW170823  GW190413_052954  GW190413_134308  GW190421_213856  GW190424_180648  GW190503_185404  GW190513_205428  GW190514_065416  GW190517_055101  GW190519_153544  GW190521  GW190521_074359  GW190527_092055  GW190602_175927  GW190620_030421  GW190630_185205  GW190701_203306  GW190706_222641  GW190719_215514  GW190727_060333  GW190731_140936  GW190803_022701  GW190828_063405  GW190909_114149  GW190910_112807  GW190915_235702  GW190929_012149   Table I. The 31 GW events for which we use the NRSur7dq4 model.
We propose a different approach: instead of measuring the spins at a given signal frequency, we can measure them at a reference point near the merger. This can be achieved for all binaries by measuring spins at either (i) the point where the GW frequency reaches a fixed dimensionless frequency, M f ref = M f ISCO = 6 −3/2 /π, or (ii) a fixed dimensionless time, t ref /M = −100 before the GW amplitude reaches its peak, as defined in Eq. (5) of Ref. [15]. Throughout this paper, we use geometric units with G = c = 1, and masses refer to the redshifted, detector-frame values. Also, here M = m 1 + m 2 is the total mass of the binary with component masses m 1 ≥ m 2 , and ISCO stands for the Schwarzschild innermost stable circular orbit [16]. Although the ISCO is only well-defined in the point-particle limit, we follow previous literature (e.g., Ref. [17]) in defining the binary's ISCO frequency to be that of an isolated Schwarzschild BH with mass equal to the total binary mass M . This reference point typically occurs within ∼ 1 − 4 GW cycles before the peak amplitude. Similarly, the reference point of t ref /M = −100 typically falls within ∼ 2 − 4 GW cycles before the peak amplitude. Therefore, independent of the binary parameters, both of these choices allow us to measure the spins near the merger.
Because the spins evolve deterministically, all choices of reference point lead to the same waveform prediction if correctly specified (i.e., the GW model is also evaluated using spins evolved to that reference point). Nevertheless, not all reference points are equivalent for spin measurements in practice. There are two considerations to take into account when choosing a reference point: (1) if the reference point falls well outside the sensitive window of the detector, parameter inference can become inefficient [18]; additionally, (2) the waveform itself can be more (less) sensitive to variations in the spin parameters at some reference point, leading to more (less) precise constraints on the spins at those reference points. In particular, the key finding of this paper is that choosing a reference point near the merger leads to improved constraints on the orbital-plane spin angles. We show that this is due to the waveform being more sensitive to parameterspace variations in these angles near the merger (cf. Sec. III B).
While the traditional choice of f ref = 20 Hz accounts for consideration (1) above, it is not optimal when it comes to consideration (2)-we find that this makes it a poor choice for measuring the orbital-plane spin angles. On the other hand, the reference points that we propose, t ref /M = −100 and M f ref = 6 −3/2 /π, satisfy both criteria, and so they provide improved spin measurements. Furthermore, since binary BHs observed by LIGO-Virgo are expected to always merge within the instruments' sensitivity band, An executive summary of this paper is as follows. We find that measuring spins near the merger (at either M f ref = 6 −3/2 /π or t ref /M = −100) leads to a marked improvement in the constraints of φ 1 and φ 2 (but not ∆φ) for several GW signals in the latest GWTC-2 catalog of events [13,[19][20][21][22] released by the LVC. Furthermore, we use numerical relativity (NR) injections to demonstrate that all three angles, including ∆φ, will be better measured near the merger for louder signals expected in the future. Finally, we study how well different waveform models are able to recover the orbital-plane spin angles from NR injections, and show that NR surrogate models alone are accurate enough to reliably measure these angles, even for moderate signal-to-noise ratios (SNRs) like ∼ 30 − 45.
The improvement in the constraints obtained by measuring the spins near merger does not reflect a gain of new information about the source, but rather that the waveform is more sensitive to variations in the orbital-plane spin angles at this point. For instance, we find that evolving the spins measured at f ref = 20 Hz to the reference point This also means that our method can be applied entirely in post-processing. Even though no new information is extracted from the data, our improved constraints can have important implications, providing a better representation of the measurement. For example, in a companion paper Varma et al. [23], we use the GWTC-2 spin measurements form this work to constrain the astrophysical distributions of the orbital-plane spin angles at t ref /M = −100. Some of the features found in Ref. [23], such as an unexpected peak in the φ 1 distribution, are only resolvable when the spins are measured near the merger.
The rest of the paper is organized as follows. We describe our parameter estimation setup in Sec. II. In Sec. III, we discuss the orbital-plane spin angle measurements for GWTC-2 events. In Sec. IV, we describe our NR injection study for louder signals as well as comparison of different waveform models. Finally, in Sec. V, we provide concluding remarks.

II. PARAMETER ESTIMATION SETUP
We obtain measurements of binary parameters from GW signals using Bayes' theorem (see Ref. [24] for a review): where p(λ|d) is the posterior probability distribution of the binary parameters λ given the observed data d, L(d|λ) is the likelihood of the data given λ, and π(λ) is the prior probability distribution for λ. For quasicircular binary BHs, the full set of binary parameters λ is 15 dimensional [13], and includes the masses and spins of the component BHs as well as extrinsic properties such as the distance and sky location. Under the assumption of Gaussian detector noise, the likelihood L(d|λ) can be evaluated for any λ using a gravitational waveform model and the observed data stream d [24]. A stochastic sampling algorithm is then used to draw posterior samples for λ from p(λ|d).
Our main results are obtained using the time-domain NR surrogate waveform model NRSur7dq4 [15]. This model accurately reproduces precessing NR simulations and is currently the most accurate model in its regime of validity [15]. NRSur7dq4 is trained on generically precessing NR simulations with mass ratios q ≤ 4 and spin magnitudes χ 1 , χ 2 ≤ 0.8, but can be extrapolated to q = 6 and χ 1 , χ 2 ≤ 1 [15]. Wherever comparison with NR is possible in the extrapolated region, NRSur7dq4 performs better than alternate models [15,25].
In addition to predicting the waveform, NRSur7dq4 also predicts the spin and orbital dynamics by numerically solving a set of ordinary differential equations (ODEs) [15]. The ODE integration can be initialized at any reference point in the inspiral. The model then evolves the component spins (and orbital dynamics) both forwards and backwards in time, and uses the evolved spins for its internal fits. During the inspiral, the ODE is informed by NR spins and dynamics. However, once the two BHs merge, the individual BH spins are no longer available in NR [29]. Therefore, starting at a time t/M = −100 before the peak amplitude, NRSur7dq4 switches to post-Newtonian-inspired equations to evolve the individual BH spins past the merger-ringdown stage [15,30]. Here, the choice of t/M = −100 is arbitrary, but once again, designed to be near the merger. The spins extended past t/M = −100 are not meant to be physical, but rather a convenient parameterization for the NRSur7dq4 internal fits in the merger-ringdown [15]. NRSur7dq4 also allows this, we find that for some GW events, the ISCO is reached at a time after t ref /M = −100, which can result in unphysical spins. Therefore, while we provide some results at M f ref = 6 −3/2 /π to demonstrate its efficacy, we will use spin measurements at t ref /M = −100 for our main results.

III. ORBITAL-PLANE SPIN ANGLE MEASUREMENTS
The GWTC-2 catalog [13,19] includes a total of 46 binary BH events. However, because NRSur7dq4 only includes ∼20 orbits before merger, it can only be applied to events with M 60 M [15] (for a detector start frequency of 20Hz). This reduces the set of events to 31; these are listed in Tab. I. All results in this section are obtained using NRSur7dq4 for these events, which we will refer to as the "NRSur7dq4 events" for convenience. We provide some results for all 46 events using the IMRPhenomTPHM model [31] in App. B.

A. Spin measurements for GW events
We first consider GW170818 [19], the event for which we see the greatest improvement when the spins are measured near the merger. Figure 2 shows the φ 1 , φ 2 and ∆φ measurements for GW170818 at the three different reference points, t ref /M = −100, M f ref = 6 −3/2 /π and f ref = 20 Hz. Here, for the joint 2D posteriors, we show 70% contours instead of the more commonly used 90% and 50% contours [13], as we find the 70% contours represent the bulk of the probability mass while being more instructive to discuss the correlations below.
First considering the marginalized 1D distributions in   than those at f ref = 20 Hz. Note that φ 1 and φ 2 peak in different regions for the three different reference points, while ∆φ is consistent for each of them. This is because φ 1 and φ 2 change on the orbital timescale, as they are defined with respect to the line-of-separation (cf. Fig. 1). On the other hand, ∆φ changes only on the precession timescale, which is longer. Furthermore, while the peaks of φ 1 and φ 2 are approximately π apart, this does not result in a ∆φ peak near ±π. Instead, the ∆φ posterior is much broader, with a mild peak near 0. This suggests that even for spins measured near the merger (t ref /M = −100 or M f ref = 6 −3/2 /π), the data is only informative about φ 1 or φ 2 , but not necessarily both at the same time. In fact, we do not see any significant improvement in the 1D ∆φ posterior near the merger for this event.
However, examining the 2D distributions in Fig. 2, we find that the posteriors for all three combinations of φ 1 , φ 2 and ∆φ are generally better constrained at Hz. The 2D posteriors also show a significant amount of correlation between the three angles. The main feature here is that φ 1 and φ 2 are better measured than ∆φ, as noted above. Therefore, the correlations are along vertical and horizontal directions in the φ 1 − φ 2 posterior, while they are along diagonal directions for the φ 1 − ∆φ posterior (and to a lesser extent for the φ 2 −∆φ posterior). We note that similar correlations are absent for the higher SNR injections shown in Sec. IV. This suggests that the degeneracies we see in the 2D posteriors of Fig. 2 are a function of the SNR, and can be broken for louder signals.
In the rest of this section, we will focus on 1D marginalized posteriors at t ref /M = −100 and f ref = 20 Hz for simplicity. While GW170818 shows the biggest improvement in φ 1 and φ 2 when measured at t ref /M = −100, we find that several other events in GWTC-2 also show significant improvements. Figure 3 compares marginalized    [32] is the only event with a good measurement at f ref = 20 Hz. This is explained by the fact that this binary merges at a low frequency due to its high mass (M ∼ 270M ), therefore its f ISCO ∼ 16 Hz happens to be close to 20 Hz. However, as we will show in Sec. IV, we expect this to change with louder signals.
An unambiguous measurement of the orbital-plane spin angles relies on being able to constrain the spin magnitudes away from zero, and the tilt angles to be neither 0 nor π. For the NRSur7dq4 events, our measurements of the spin magnitudes and tilts are consistent with Refs. [13,19], and are shown in App. A. Most of these events are consistent with having zero spin magnitudes for both BHs [13], but there is evidence for nonzero spin magnitude in at least some of the events [33]. Secondly, even though there is evidence of precession in the astrophysical binary BH population [34], the individual events are not loud enough to show clear evidence of precession on their own [13]. Nevertheless, these measurements still allow us to place interesting constraints on the astrophysical distributions of the orbital-plane spin angles. This is explored in a companion paper, Ref. [23].

B. Varying the reference point
Next, we systematically study the impact of the reference point at which the spins are measured. Figure 5 shows φ 1 measurements at various different reference times for the 31 NRSur7dq4 events. Rather than repeat the parameter estimation at each t ref , we use the NRSur7dq4 spin dynamics [15] to evolve the spins backwards from t ref /M = −100 to the earlier times. As noted in the introduction, we find that this leads to results consistent with measuring spins directly at the new reference time. The earliest reference time we consider is t ref /M = −4000, which is near the start of the NRSur7dq4 waveform's validity [15].
In Fig. 5, as we move the reference point from t ref /M = −4000 to t ref /M = −100, we see a clear improvement in the φ 1 constraint for several events. A possible explanation for this improvement is that the precession (and orbital) timescale decreases as the merger approaches, making the waveform more sensitive to the orbital-plane spin angles near the merger as a result. We provide further justification for this with a Fisher matrix analysis in the following.

Fisher matrix analysis
As a proxy for the sensitivity of the waveform to the orbital-plane spin angles, we can look to the Fisher information matrix. The Fisher matrix provides a simple way to estimate the statistical uncertainty in measuring binary BH parameters in the high-SNR limit [35,36]; it is defined as where h(t) is the gravitational waveform with binary parameters λ = {λ i } (cf. Sec II), and the inner product (h|g) is defined as whereh(f ) indicates the Fourier transform of h(t), * stands for complex conjugation, and S n (f ) is the onesided power spectral density for which we use the LIGO design sensitivity noise curve [37]. We use the NRSur7dq4 waveform model for h(t) and compute the derivatives in Eq.
(2) numerically [38]. Then, using the Cramer-Rao inequality [39,40], the measurement covariance matrix Var(λ i , λ j ) satisfies Finally, taking the lower bound of the inequality, the statistical uncertainty in λ j can be estimated as and the correlation coefficient between λ i and λ j can be estimated as, The Fisher matrix method reliably estimates the statistical uncertainty only in the limit of high SNR (see, e.g., Ref. [41] for caveats). Regardless, here we are not interested in the statistical uncertainty itself but in quantifying the sensitivity of the waveform to variations in the binary parameters. The Fisher matrix method is wellsuited for this purpose, as we use its bound on statistical uncertainties merely as a proxy for waveform sensitivity: smaller δλ j indicates that the waveform is more sensitive to λ j .
In particular, we are interested in how sensitive the waveform is to changes in the orbital-plane spin angles     at various values of t ref .
For this purpose, we generate the waveform corresponding to the maximum likelihood parameters for each of the 31 NRSur7dq4 events. Then, we compute Corr(φ 1 , φ 2 ) by varying the spins for the same waveform at different t ref . We initially compute the statistical uncertainties at a fixed distance of 100 Mpc, but then rescale them following δλ j ∝ 1/SNR to correspond to an SNR (defined as (h|h) ) matching that of the observed event. Figure 6 shows the statistical uncertainties in φ 1 and φ 2 as we move from t ref /M = −4000 to t ref /M = −100. In almost all cases, we see that the statistical uncertainty decreases as we approach the merger, meaning that the waveform is generally more sensitive to variations in φ 1 and φ 2 near the merger. This explains the improved measurement at t ref /M = −100 in Fig. 2 and Fig. 3, as well as the systematic improvement as we approach t ref /M = −100 in Fig. 5.
To summarize, since the observed waveform is most sensitive to the orbital-plane spin angles near merger, the data can successfully constrain these angles at that point. However, the precision of that measurement is not preserved as we extrapolate the spins back in time because, even though the dynamics are deterministic, this detail gets smeared out during the inspiral cycles.
Finally, we note that the direction of the δφ 1 − δφ 2 correlations in Fig. 6 depend on where in the evolution they are evaluated. This is in agreement with Ref. [38], which found that the inspiral and ringdown regions of the waveform carry complimentary information.

IV. NR INJECTION STUDY
To further investigate the measurability of the orbitalplane spin angles, we consider four NR waveforms, SXS:BBH:0139, SXS:BBH:0143, SXS:BBH:0632, and SXS:BBH:0633, from the public SXS catalog [29,42,43]. These waveforms correspond to systems with mass ratios q < 2 and substantial orbital-plane spins. Note that none of these waveforms were used to train NRSur7dq4. We choose a total mass M = 70M , an inclination angle ι = π/3 between L and the line-of-sight, and a reference orbital phase φ ref = π/3. Note that ι and φ ref are defined at f ref = 20 Hz. The luminosity distance is chosen such that the network matched-filter SNR is either 30 or 45. The rest of the binary parameters will be shown in figure insets below. We inject these NR waveforms (in zeronoise) into a simulated LIGO-Virgo network operating at design sensitivity [37], and recover them using different waveform models. The injection and parameter inference are done using the Parallel Bilby [26] package. In Sec. III, we showed that the constraints on the orbital-plane spin angles become tighter when measured near the merger. It is important to verify that this tighter constraint does not lead to biased estimates for these angles. We verify this in Fig. 7 Measuring the full orbital-plane spin degrees of freedom is necessary to constrain the kick population as done in Ref. [23].

B. Waveform model comparison
In this section, we study the performance of different waveform models in recovering the orbital-plane spins angles. Apart from NRSur7dq4, we also consider the phenomenological models IMRPhenomTPHM [31] and IMRPhenomXPHM [44], as well as the effective-onebody model SEOBNRv4PHM [45]. While these models also include some effects of precession, they are not calibrated on precessing NR simulations. Note that IMRPhenomTPHM and SEOBNRv4PHM are time-domain models, while IMRPhenomXPHM is a frequency-domain model. We only consider spin measurements at f ref = 20 Hz for these models as specifying spins at a dimensionless time/frequency would require careful modifications to how these models are implemented. However, because binary BH spin evolution is deterministic, any biases seen at f ref = 20 Hz should translate to biases at t ref /M = −100 as well. We repeat our NR injections at SNRs of 30 and 45, but because SEOBNRv4PHM is significantly more expensive than the other models, we only apply it to the injections at SNR=45. Figure 9 shows φ 1 , φ 2 and ∆φ posteriors obtained using   Figure 8. Same as Fig. 7, but now the SNR is increased to 45. ular, there are cases (e.g. φ 1 in top-right panel and ∆φ in the bottom-left panel of Fig. 9) where this model has the strongest peaks in the 1D distributions, but prefers the wrong value. By contrast, for both NRSur7dq4 and IMRPhenomTPHM, the true value is always within the 90% credible region of the 2D posteriors.
However, for SXS:BBH:0139 (bottom-left of Fig. 9), the 1D ∆φ posterior for IMRPhenomTPHM is peaked away from the true value, even though the true value is included in the 90% credible region of the 2D posteriors. For NRSur7dq4, the peak in the 1D ∆φ posterior is much broader in this case and includes the true value. To check whether this is indicative of a systematic bias in IMRPhenomTPHM, we consider the 50% credible region of the 2D posteriors for this case, which is shown only for IMRPhenomTPHM for simplicity. The 50% credible region clearly excludes the true value for IMRPhenomTPHM, meaning that the bulk of the probability density for IMRPhenomTPHM is concentrated in a region away from the true value. This suggests that the deviation in the 1D ∆φ posterior for IMRPhenomTPHM is indeed due to a systematic bias. This serves as an another example where  a waveform model has a stronger peak than NRSur7dq4 in the 1D posterior, but is peaked at the wrong value. Similarly, for SXS:BBH:0143 (bottom-right of Fig. 9), the true value is included in the 2D posteriors but the 1D φ 1 posteriors appear biased for both IMRPhenomTPHM and NRSur7dq4. However, in this case the primary BH has negligible spin in the orbital-plane, therefore φ 1 is not a meaningful parameter and hence the offset from the true value is not of concern.
We repeat these NR injections at SNR=45 in Fig. 10, now also including the SEOBNRv4PHM model. We now find that the true value is fully excluded from the 90% credible region of the 2D posteriors for IMRPhenomXPHM for three out of our four injections. While IMRPhenomTPHM still performs better than IMRPhenomXPHM, IMRPhenomTPHM also excludes the true value from the 90% credible region of the 2D posteriors for SXS:BBH:0143 (bottom-right of   Figure 10. Same as Fig. 9, but now the SNR is increased to 45, and only 90% contours are shown for all joint posteriors. the 1D ∆φ posterior is still biased for IMRPhenomTPHM, while NRSur7dq4 now has a clear peak around the true value. This suggests that NRSur7dq4 is not prone to the systematic biases present in IMRPhenomTPHM (and IMRPhenomXPHM) as noted above.
In Fig. 10, SEOBNRv4PHM is comparable to IMRPhenomTPHM, including the true value in the 90% region of the 2D posteriors for three out of four cases, with the exception being SXS:BBH:0139 (bottom-left of Fig. 10). For this case, SEOBNRv4PHM also shows similar biases in the ∆φ 1D posterior as IMRPhenomTPHM (and IMRPhenomXPHM), meaning that this model is also prone to the systematic biases noted above. Furthermore, for SXS:BBH:0632 (top-left of Fig. 10), the 1D posteriors for φ 1 and φ 2 for SEOBNRv4PHM are biased (although they are somewhat included in the smaller secondary modes).
Finally, we note that NRSur7dq4 generally leads to the best constraints in Figs. 9 and 10, which is expected as this is the only model trained on precessing NR simulations (but not the ones injected here). Instead, the IMRPhenomTPHM, IMRPhenomXPHM and SEOBNRv4PHM models approximate orbital precession by "twisting" [31,44,45] a corresponding nonprecessing waveform. While this captures the leading effect of precession, it does not account for effects such as asymmetries between pairs of ( , m) and ( , −m) spin-weighted spherical harmonic waveform modes [15]. Missing physics like this can lead to the systematic biases we see in Figs. 9 and 10. Among the phenomenological models, IMRPhenomXPHM is known to have a less accurate precession treatment than IMRPhenomTPHM [31], which could be responsible for IMRPhenomXPHM having the largest biases in our tests.

V. CONCLUSIONS
We propose that binary BH spins be measured at a reference point close to the merger, at either a dimensionless GW frequency M f ref = M f ISCO = 6 −3/2 /π or a dimensionless time t ref /M = −100 before the peak of the GW amplitude. We demonstrate that this leads to significant improvements in the measurement of orbital-plane spin orientations φ 1 and φ 2 for various events in the GWTC-2 catalog, while ∆φ is not significantly impacted. However, using NR injections, we show that ∆φ will also be better measured near the merger for louder signals expected in the future.
Using the same NR injections, we compare the performance of the waveform models NRSur7dq4, IMRPhenomXPHM, IMRPhenomTPHM, and SEOBNRv4PHM, in recovering φ 1 , φ 2 and ∆φ at f ref = 20 Hz. As expected, NRSur7dq4 provides the most accurate constraints for these angles, as this is the only model informed by precessing NR simulations. Among the other models, in general, we find that IMRPhenomTPHM and SEOBNRv4PHM perform better than IMRPhenomXPHM. However, even at moderate SNRs (∼ 30−45), we find examples where these models have biased estimates. This highlights the need to train waveform models on precessing NR simulations in order to reliably extract the full spin information from binary BH signals. IMRPhenomXPHM, IMRPhenomTPHM, and SEOBNRv4PHM do not currently allow specifying the spins at t ref /M = −100 or M f ref = 6 −3/2 /π, but we expect these biases will persist at those reference points.
In a companion paper, Ref. [23], we use the improved spin measurements obtained here to constrain the astrophysical distribution of the orbital-plane spin angles as well as merger kicks for the binary BH population. Notably, we find a preference for ∆φ ∼ ±π in the population, which can be a signature of spin-orbit resonances [5].

ACKNOWLEDGMENTS.
We thank Rory Smith and Avi Vajpeyi for support with the Parallel Bilby [26] package, and Hector Estelles, Sascha Husa, Geraint Pratten, and Marta Colleoni for support with the phenomenological waveforms. We thank Sizheng Ma for sharing his Fisher matrix code. We thank Davide Gerosa  Hz. We show the spin magnitudes χ 1,2 , cosines of the tilt angles cos θ 1,2 , and orbital-plane spin angles φ 1 , φ 2 , and ∆φ. The priors on all of these parameters are flat in their respective ranges (cf. Sec. II). The spin magnitude and tilt posteriors are consistent between the two reference points, and are consistent with Ref. [13].
In the following, we refer exclusively to the spin measurements at t ref /M = −100, and check whether measurements of the orbital-plane spin angles can be tied to a measurement of χ 1,2 and cos θ 1,2 . As noted in Sec. III A, an unambiguous measurement of the orbital-plane spin angles relies on being able to constrain χ 1,2 away from zero, and cos θ 1,2 away from ±1. GW190521 (Fig. 13) is a good example of this: there is a clear preference for large χ 1,2 and cos θ 1,2 ∼ 0, which likely enables the φ 1 and φ 2 measurement for this event. On the other hand, for GW190517_055101 (Fig. 13), there is a preference for large χ 1,2 , but with cos θ 1,2 ∼ 1. However, we still see peaks in the φ 1 and φ 2 distributions. Finally, for GW170818 (Fig. 11), there is a mild preference for small χ 1 and a similarly mild preference for large χ 2 , but this is the event with the best φ 1 and φ 2 measurement (cf. Fig. 3). We conclude that current constraints on the spin magnitudes and tilts are too broad to look for such correlations with the orbital-plane spin angles: even for the cases where we see a preference for small χ 1,2 and/or cos θ 1,2 ∼ ±1, there is enough support for large χ 1,2 and cos θ 1,2 ∼ 0, that there can be peaks in the posteriors of the orbital-plane spin angles.

Appendix B: Results for all GWTC-2 events using IMRPhenomTPHM
The results in Sec. III were restricted to the 31 GWTC-2 with M 60M due to the length restrictions of NRSur7dq4. The remaining 15 events with M 60M are listed in Tab. II. For completeness, we now analyze all 46 GWTC-2 binary BH events using IMRPhenomTPHM. We choose IMRPhenomTPHM as this model performed better than IMRPhenomXPHM in Sec. IV. Once again, for simplicity, we only consider f ref = 20 Hz for IMRPhenomTPHM. For the 15 events with M 60M , we relax the prior constraints described in Sec. II to: 5 ≤ M ≤ 400, q ≤ 20, and 10 ≤ M ≤ 400. 60M   GW151012  GW151226  GW170104  GW170608  GW170814  GW190408_181802  GW190412  GW190512_180714  GW190707_093326  GW190708_232457  GW190720_000836  GW190728_064510  GW190828_065509 GW190924_021846 GW190930_133541   Fig. 4. Interestingly, for GW190521, IMRPhenomTPHM has a clear peak at ∆φ ∼ 0 which is absent for NRSur7dq4. This shows that waveform systematics are already important to consider for current GW events when measuring the orbital-plane spin angles. While further investigation is necessary to understand the nature of this peak, we note once again that IMRPhenomTPHM can have biases in 1D ∆φ distributions as shown in the bottom-left panels of Fig. 9 and Fig. 10. In these cases, the IMRPhenomTPHM ∆φ posterior is more sharply peaked (compared to NRSur7dq4) but is also biased.