A Unified $p_\mathrm{astro}$ for Gravitational Waves: Consistently Combining Information from Multiple Search Pipelines

Recent gravitational-wave transient catalogs have used \pastro{}, the probability that a gravitational-wave candidate is astrophysical, to select interesting candidates for further analysis. Unlike false alarm rates, which exclusively capture the statistics of the instrumental noise triggers, \pastro{} incorporates the rate at which triggers are generated by both astrophysical signals and instrumental noise in estimating the probability that a candidate is astrophysical. Multiple search pipelines can independently calculate \pastro{}, each employing a specific data reduction. While the range of \pastro{} results can help indicate the range of uncertainties in its calculation, it complicates interpretation and subsequent analyses. We develop a statistical formalism to calculate a \emph{unified} \pastro{} for gravitational-wave candidates, consistently accounting for triggers from all pipelines, thereby incorporating extra information about a signal that is not available with any one single pipeline. We demonstrate the properties of this method using a toy model and by application to the publicly available list of gravitational-wave candidates from the first half of the third LIGO-Virgo-KAGRA observing run. Adopting a unified \pastro{} for future catalogs would provide a simple and easy-to-interpret selection criterion that incorporates a more complete understanding of the strengths of the different search pipelines


I. INTRODUCTION
The detection of gravitational waves (GWs) [1,2] by the Laser Interferometer Gravitational-Wave Observatory (LIGO) [3] and Virgo [4] detectors is the culmination of decades of research.Not only are sensitive detectors needed to the measure the GW signals [5,6], but sophisticated data analysis is needed to distinguish astrophysical signals from detector noise [7].In the case of transient signals, such as those from compact binary coalescences (CBCs), detection algorithms identify candidate signals by matching template signals to the data [8] or by looking for coherent signals in multiple detectors [9].Only after identifying GW candidates in detector data can we start to understand the astrophysical population of GW sources.
A fundamental question in any data analysis problem is the veracity of the signal or the effect it is considering.The statistical significance of a transient GW is generally assessed by calculating how likely the data would appear due to noise fluctuations.This may be quantified by the false alarm rate (FAR) or by a p value.Statistics like FAR are particularly well suited for making a first detection, where we do not know the population of signals.However, a more complete assessment of the probability that a candidate is real can be obtained by considering not only the FAR but also the true alarm rate (i.e., how often the algorithm would find identify a real signal) whose calculation requires knowledge of the source population and our sensitivity to their signals, and hence is subject to additional uncertainties not inherent in the calculation of the FAR or p value.By combining the false and true alarm rates, we can calculate the probability of astrophysical origin p astro for a candidate [10,11].In the GW literature, p astro was first used to estimate the astrophysical probability of GW150914 and the then-considered marginal trigger LVT151012 [12,13] (now known as GW151012 [14]).The probability of astrophysical origin p astro directly addresses the key question of how probable a candidate is to be real accounting for both our understanding of detector sensitivity and the population of GW sources.
We are now in the era of having many GW candidates, such that we are building an understanding of the source population [15].GWs catalogs from both the LIGO-Virgo-KAGRA (LVK) Collaboration [2,14] and independent teams [16,17] now commonly use p astro as a criterion to identify interesting candidates for further analysis.By the end of their third observing run (O3), the LVK Collaboration have reported 90 CBC candidates with p astro > 0.5 [2,14,18,19].Despite the additional uncertainties inherent in its calculation, p astro is important to consider now that we now have observations from multiple observing runs of different sensitivity [6] and observations of different types of sources [1,2,20,21].At a given FAR, a candidate from the most recent O3 run [22,23] is more probable to be real than a candidate from the (less sensitive) first observing run [24].Similarly, at a given FAR, a candidate consistent with originating from a 30M ⊙ + 30M ⊙ binary black hole (BBH) merger is more probable to be real than a candidate consistent with a binary neutron star (BNS) origin [25] because the former can be detected from a greater distance and so are more frequently observed [2].Since p astro accounts for the true alarm rate, it can naturally be used when compiling a heterogeneous catalog of diverse sources from different observing runs.
Unfortunately, this complicates the use of p astro when interpreting current GW candidates in three ways.First, there are differences in the assumptions about the source population between analyses that mean that results are not directly comparable.Second, as illustrated in Fig. 6 of GWTC-3 [2], pipelines can have different relative sensitivities, which may provide valuable insights into whether the candidate is real.This information is not incorporated in individual p astro estimates.Third, it means there is no single assessment of the significance of a candidate, complicating the calculation of contamination and assessment of search sensitivity (and hence selection effects, which are essential for population studies).Indeed, as can be seen in Table 1 of GWTC-3 [2], pipelines can often find a different set of triggers over the same stretch of data and can sometimes calculate significantly different p astro values for the same candidate.To fully make use of the different search pipelines, it is necessary to combine their results to produce a single p astro value for each candidate.Some early work in combining results from multiple pipelines was done in the pre-detection era.For exam-ple, developing a way to estimate generalized frequentist upper limits with multiple pipelines [42], and developing unified detection statistics [43].In this paper, we present a framework that can combine the results from multiple detection algorithms to produce a unified p astro employing information about the correlation between pipelines.This means that catalogs can be produced consistently using a simple threshold on this statistic and that downstream users may more straightforwardly calculate the contamination fraction for such a catalog.This statistic is amenable to use in population inferences that incorporate low-significance candidates [44,45].Crucially, as the formalism accounts for the correlations between different pipeline outputs (or lack thereof), we can make use of the full ensemble of results when evaluating whether a candidate is real.We present an illustration of the properties of our p astro formalism, and provide a proof-of-concept example of its use with real GW data from the first half of O3 (henceforth called O3a) analyzed for GWTC-2.1 [19].
In Sec.II we define p astro from first principles and describe its connection with the Poisson mixture-model formalism, first developed by Farr et al. [10] (henceforth called the FGMC method) in the GW context.Section III extends this formalism to calculate unified p astro from multiple search pipelines.We demonstrate some of its useful properties by the means of a simple toy model in Sec.IV, followed by an illustrative application to triggers from GWTC-2.1 in Sec.V.This illustration highlights the tools that need to be developed in order to produce a reliable unified p astro for use in future GW catalogs.We discuss the applications and extensions of our formalism in Sec.VI.

II. DEFINING pastro
In this section, we present a pedagogical derivation for p astro as commonly seen in literature [10,12,13,25], connecting it the Poisson mixture-model FGMC formalism.In Sec.III, we explain how to extend this for the case of multiple search pipelines.
The primary goal of a search pipeline is to identify candidates of astrophysical origin.Pipelines usually accomplish this by maximizing some statistics over a small stretch of data, and identifying a particular candidate, which we will call a trigger.We assume that the stretch of data contains at most one signal.Every trigger has a detection statistic x associated with it by the search pipeline.
The distribution of the detection statistic under pure noise lets us calculate the FAR distribution of the pipeline.In the context of GW analysis, this can be done by the bootstrapping method of time slides, where the data stream from one detector is shifted with respect to another by more than the light-travel time between them [46], or by constructing a model of the noise using triggers that are not coincident between detectors [47].
We define p astro as the probability that a particular trigger is caused by an astrophysical signal, as opposed to a noise fluctuation or terrestrial contamination.Hence p astro is properly associated with the trigger and its detection statistic (one specific reduction of the data) rather than directly the underlying data itself.When different pipelines analyze the data, they make different analysis choices which give them differing sensitivities, and so it is possible that they yield different triggers and p astro values.
Since a trigger can be caused by an astrophysical signal or by noise, p astro depends on the posterior probability for these two hypotheses.We define p(S|x, Φ s , Φ n ) as the posterior probability for the signal hypothesis S, and p(∅|x, Φ n ) as the posterior probability for the noise hypothesis ∅.Both these probabilities are conditioned on the detection statistic of the trigger x, and assume some signal and noise parameters, Φ s and Φ n .We can then write p astro for the trigger as the normalized probability that it is astrophysical, We shall suppress the signal and noise parameters from here on for the purpose of clarity.A value p astro = 1 implies perfect confidence about the presence of a signal in an underlying segment of data, while p astro = 0 implies perfect confidence of its absence, and p astro = 0.5 implies no preference between the signal and noise hypothesis.The posterior probabilities for two cases are difficult to compute directly.What is more easily accessible is the likelihood of the trigger x, under the signal and noise hypotheses.Therefore using Bayes' theorem, we write p(S|x) and p(∅|x) as where p(x|S) and p(x|∅) are the likelihoods for getting a trigger with statistic x under the signal and noise hypothesis, respectively; π s and π n are the corresponding priors, and Z(d) is the Bayesian evidence.We can then write p astro as Suppose that we analyze a long stretch of data with a single search pipeline.Let R s and R n be the rates of astrophysical and noise-only triggers, respectively, assuming a given detector sensitivity, operating characteristics, and algorithmic choice (including any preliminary thresholds that are applied) made by the search pipeline.We define the signal and noise count parameters, Λ s and Λ n , respectively.These are the mean number of astrophysical and noise triggers (above the predefined thresholds) for an observing duration T (not necessarily the number of triggers in any particular realization of the data) such that The total number of triggers N will follow a Poisson distribution with a count parameter Λ = Λ s + Λ n : For computational reasons, it is common to require that the trigger has to pass some primary statistical threshold, before being used for further analysis.For example, the GWTC-2.1 [19] and GWTC-3 [2] analyses considered only triggers that had a FAR ≤ 2 day −1 .Our formalism does not make any strong assumptions about the preliminary threshold, as long as it is consistently used among all pipelines.Thresholds are incorporated by suitably defining the counts Λ s and Λ s as for triggers that pass the threshold.
The count parameters are generally assumed to be unknown and have to be measured empirically using the triggers.To do this we use the Poisson mixture-model FGMC formalism: where {x} denotes the set of all triggers found in the data.Conditioned upon Λ s and Λ n , the prior probabilities are just the relative rate of the triggers of that category, Therefore, Eq. ( 6) becomes Using Bayes' theorem, we can relate the posterior distribution on the count parameters with the the Poisson likelihood, where π(Λ s , Λ n ) is the prior on the mean count.Putting this together with Eq. ( 5), we obtain the posterior distributions for Λ s and Λ n , Finally, combining Eq. ( 3) and Eq. ( 7), we can calculate a p astro that marginalizes over the posterior of Λ s and Λ n , This fundamental framework has been used by all search pipelines both by the LVK Collaboration and other groups in compiling GW catalogs [2,12,13,16,17,19,[48][49][50].

III. TOWARDS A UNIFIED pastro
We now develop a way to combine information from multiple pipelines for calculating a unified p astro .Suppose that we have several search pipelines that yield triggers after running over the same underlying data.Consider any two corresponding triggers x α and x β from pipelines α and β; as these triggers are produced by different pipelines they may be associated with different statistics, but these statistics will be correlated depending upon the relative sensitivities of the two pipelines.Understanding the correlations between pipelines is key to understanding how to construct a unified p astro .
The correlations between different search pipelines are not typically accounted for when compiling GW results.In some analyses, the FAR is multiplied by a trials factor equal to the number of different searches [51,52].However, this is a conservative choice: it would be correct if the results were noise triggers that were uncorrelated, but in general, we would expect some correlation since search pipelines are searching for similar signals in the data.This correlation should reduce the effective trials factor (in the limit of running two identical pipelines there would be no need to add a trials factor).Our framework accounts for correlations in both noise triggers and signal triggers to construct a unified p astro .
Let us define ⃗ x = {x α , x β , ....} as the triggers that correspond with each other from a series of different pipelines.For the rest of this paper, a vector usually means a vector of pipeline outputs, such that {⃗ x} is the set of all triggers from all the pipelines in the stretch of data being analyzed, and a latin index is used to indicate the individual data segment or trigger.As in Eq. ( 2), the probability p(S|⃗ x) is given by, Similarly, for the noise hypothesis, Here, p(⃗ x|S) and p(⃗ x|∅) are joint likelihoods for obtaining ⃗ x = {x α , x β , ....} and are dependent on the aforementioned correlations between trigger statistics across pipelines.If we learn these joint likelihood distributions, using simulated signals and noise triggers, we can calculate a unified p astro .While several methods to learn arbitrary distributions exist, we will primarily use kernel density estimation (KDE) methods from scikit-learn [53] to learn p(⃗ x|S) and p(⃗ x|∅).The choice of method used to reconstruct the distribution is important, and the distributions must be properly characterized to successfully calculate a reliable p astro .However, for the purposes of illustrating the framework needed to compute a unified p astro , this method may be treated as a black box.
Once the distribution of these correlations is learned, one can modify Eq. ( 10) for a joint FGMC estimate Equation ( 11) can then be modified to calculate a unified p astro marginalized over signal and noise counts, This equation therefore incorporates correlations between pipelines through the joint likelihoods p(⃗ x|S) and p(⃗ x|∅).
The role of correlations in calculating p astro may be understood by considering a few idealized cases.If we had two pipelines that looked for the same type of signal, but had different sensitivities to noise triggers, we might expect to be more certain in a candidate (assign a higher p astro ) if there are corresponding triggers from both pipelines.However, if a candidate was identified by one pipeline, and not by another which has greater sensitivity to that type of signal, we might expect to suspect it as a false alarm (assign a lower p astro ).In the following subsections, we will demonstrate the properties of our unified p astro in some illustrative cases.

IV. TOY MODEL
In this section we illustrate and test the unified p astro method using a simple toy model.In Sec.V, we will then apply the method on real GW data, albeit in a simplified analysis.
The toy model data is generated in the form of segments, each of which consists of four data points with values drawn from a standard normal distribution.A segment might also contain a signal with a probability proportional to Λ s .Signals are simulated by adding a value λ s to one of the data points in the segment.We assume that λ s is known.The goal of an FGMC-type analysis is to estimate the fraction Λ s .Figure 1 demonstrates a schematic of the data generated in the toy model.
We construct four simple pipelines to analyze this data on a segment-by-segment basis.The pipelines are all based on a maximum-likelihood estimate assuming Gaussian noise statistics, and the pipelines all assume that the noise is drawn from a standard normal with σ = 1.For each segment, the pipelines calculate the noise likelihood and the signal likelihood, assuming the signal value λ s is known where x i are the analyzed data points in the segment.The pipelines might use different number of data points, corresponding to the number k ≤ 4. Each pipeline calculates a p value for every segment (under the noise hypothesis) which is the main input for the unified p astro analysis.The four pipelines are 1.Pipeline 1 : This pipeline makes no assumption on which of the four data points in a segment has the signal, effectively marginalizing over the position of the signal.Therefore, k = 4 in Eq. ( 16) and Eq. ( 17).
2. Pipeline 2 : This pipeline still assumes Gaussian statistics but only considers the data point with the loudest value of the four and calculates its likelihood.The is equivalent to setting k = 1 in Eq. ( 16) and Eq. ( 17), but replacing x i with max{x i }.This pipeline will be more effective at detecting a loud signal but will generally overestimate the number of signals.
3. Pipeline 3 : This only uses the first two data points of a segment.Therefore, k = 2 in Eq. ( 16) and Eq.(17).Since in our toy model, we know that each data point is equally probable to contain a signal, we expect that this pipeline will witness only about half of the total signals, and consequently its count parameter will be around half the true value.Thereby, this is analogous to a search pipeline that will be sensitive to only a subset of GW signals (e.g., those from high-mass systems).
4. Pipeline 4 : Similar to Pipeline 3 except it only uses the last two data points.Therefore Pipeline 4 is statistically independent from Pipeline 3 while still seeing only half of the total signals.Combining this pipeline with Pipeline 3 is the simplest example of a correlation between pipelines.
For each pipeline, we also estimate the single-pipeline signal and noise counts using the signal and noise likelihoods.
Using the p value from the pipelines as their detection statistics, we will calculate the joint likelihoods using a KDE as described in Sec.III.To do this we create a number of segments containing a signal to train a signal KDE, and similarly train a noise KDE using segments that do not contain a signal.Then for a simulation where Λ s is unknown, we can score the outputs of the pipelines against the KDEs to perform a unified FGMC analysis using Eq. ( 14), and measure the noise and signal counts, Λ n and Λ s .
The algorithmic choices made by the pipelines mean that their results differ from each other, and can introduce a bias in their estimate of signal counts (and thereby p astro ).For instance, consider Pipeline 3, which can only see about half of the total signals.When combining the output of the pipelines, the KDEs do not know a priori about these biases and will have to learn it from the triggers.In a joint analysis between Pipeline 3 and Pipeline 4, the signal KDE can learn that only about half the analyzed segments will have a low p value in Pipeline 3 (i.e., when a signal is added to the first two data points) and that these are the segments in which Pipeline 4 will likely give a high p value (and vice versa).Figure 2 shows an FGMC analysis for two separate toy model analyses.In both Figs.2a and 2b, we separately simulated 1000 segments of data with about a quarter (Λ s = 250) of the segments having a signal with λ s = 3. Figure 2a shows the signal and noise counts estimated by Pipeline 1 and Pipeline 2, along with a unified fit.As expected, Pipeline 2 shows a bias in its recovery which nevertheless the joint analysis can correct for.Figure 2b shows the results of a toy-model analysis with Pipeline 3 and Pipeline 4. Both pipelines only see about half of the simulated signals, by design.However, the joint analysis learns this bias through the signal KDE and can account for it.
This section shows, using a simple example, how outputs from multiple pipelines can be combined using the unified formalism that we have developed.It also shows how certain kinds of biases can be corrected when multiple pipelines are joined together, by drawing information from the correlations (or lack thereof) between the outputs of the pipelines.These examples demonstrate the robustness of the unified p astro method.
V. APPLICATION TO GWTC-2.1 We now test the unified p astro method with real GW triggers (henceforth on-source triggers) by applying this to O3a results from GWTC-2.1 [21].GWTC-2.1 applies a preliminary cut of FAR ≤ 2 day −1 , which we also adopt.While this analysis uses real data, it is only intended to be illustrative, and we adopt some simplifications for computational simplicity.
We use results from the GstLAL [26,27,29] and PyCBC [32][33][34] pipelines, and restrict ourselves to triggers that are at least Hanford-Livingston coincident (i.e., present in data from LIGO Hanford and Livingston at a minimum).GWTC-2.1 [21] uses two versions of the Py-CBC pipelines; referred to as PyCBC-broad and Py-CBC-BBH; in this analysis we only use the former version, which we will simply refer to as PyCBC. 1 While our formalism is general, we focus solely on CBC signals; a choice that is primarily driven by prudence, as we have only seen the population of CBC signals so far [2].Furthermore, for the sake of simplicity we will only use BBH signals for the training; this will also serve as a test to see if the method can distinguish sufficiently between signal types and to test overfitting.These simplifications make it easier to illustrate the behavior of the unified p astro , and highlight the key considerations for an analysis to be performed for future GW catalogs.
We use pipeline settings that are as similar as possible to the runs in GWTC-2.1 [21].A crucial factor to consider is that the search pipelines have to be run in a way that we can establish a one-to-one correspondence between their triggers so that we can learn the correlations between them.This is necessary even in the case where one pipeline registers a trigger and the other does not because the absence of the trigger is itself useful information.While one-to-one correspondence is straightforward for simulated signal triggers (described in Sec.V A), it can be somewhat ambiguous to define in the case of noise triggers (implementation details are explained in Sec.V B).
In order to calculate a unified p astro , Eq. ( 14), and Eq. ( 15) require us to construct the joint distribution of a statistic from each pipeline.We choose to use the FAR  both GstLAL and PyCBC.The stars are the O3a on-source triggers from GWTC-2.1 found by both pipelines [19].The upper cutoff of the PyCBC distribution that is visible in the plot comes from the fact that the pipeline places a limit on the FAR based on the number of time slides performed.This translates to a cutoff at β ≈ 15.76 and results in a bump in the PyCBC distribution at these values of β.FIG.3: Joint likelihood distributions for the signal and noise hypothesis.The spans of the signal and noise distributions span are dissimilar.The signal distribution extends to much larger β values than the noise distribution.
The contours correspond to 50% and 90% levels in both plots.as the base statistic here as it is easily interpretable, commonly used by all pipelines and informative of the relative significance of triggers.However, the dynamic range of FARs can be large, and we therefore define a statistic β that is related to the logarithm of the inverse FAR,

GstLAL Ø 10
where κ is a scaling constant, which we set to a value of κ = 100 yr −1 .This number is set to give a good dynamic range to β but our results are insensitive to its actual value.Higher values of β correspond to more significant candidates.
We construct three cases each for the noise and signal hypotheses, corresponding to triggers that are registered (i) only by GstLAL, (ii) only by PyCBC, and (iii) by both pipelines.We find that separate modeling the three cases is necessary in order to achieve a faithful fit of signal triggers, which can cover a wide range on the β space and show significant structure.In addition, this allows us to consider the case where one pipeline would not see a signal but the other does.For example, in the GWTC-2.1 analysis [19], PyCBC did not assign FARs to events which triggered in a single detector (although this is now possible [54]), but GstLAL did.In the future, a more sophisticated fitting method like a convolutional neural network or a random forest classifier might make multiple separate fits unnecessary.
The distributions are usually normalized such that the total probability over its parameter space is one.Therefore, when considering multiple separate cases we need to renormalize them to account for the relative probability of the class they represent.We do this by multi-plying the output of the distribution with the fraction of signal or noise triggers that fall in that particular class.We find that getting a joint trigger is ≈ 104 times more likely under the signal hypothesis than the noise hypothesis.Meanwhile, a GstLAL-only (PyCBC-only) trigger is ≈ 2.77 (≈ 3.87) times more likely under the noise hypothesis than the signal hypothesis.The numbers quoted correspond to triggers that pass a 2 day −1 FAR threshold.

A. Signal distribution
The signal distributed is generated by fitting simulated software signals or injections that are added to the detector noise, to a KDE.We use the same simulated signals as from the GWTC-3 analysis [55], whose distribution is described in detail in Appendix C.7 (the injection row of Table X) of the GWTC-3 paper [2].In particular, the source masses are distributed as The redshift distribution is flat in comoving volume, with a maximum redshift of 1.9, assuming a ΛCDM cosmology [56].We restrict ourselves to injections with sourceframe component masses 3M ⊙ -100M ⊙ to focus solely on BBHs.The simulated population is not a close match to the inferred astrophysical population [15,57].However, it shares broad characteristics and is simple to use.Therefore it suffices for our illustrative calculation.Using a population model that more closely resembles the underlying population should enable more accurate calculation of the true alarm rate, and hence p astro .In any KDE-based analysis, the choice of the KDE bandwidth is important, especially for multidimensional distributions with intricate shapes.In our illustrative analysis, the bandwidth was picked manually by checking that the KDE approximates well the distribution of injections.We verified that the one-dimensional cumulative distributions of the KDE and the injection set match, and also cross-checked the KDE against a KDE using the Silverman rule of thumb [58].Ultimately, a realistic application would need a more sophisticated fitting scheme, for example, KDEs fit using an iterative approach [59] or a machine-learning approach such as using normalizing flows.
The top plot of Fig. 3a shows the joint distribution of signals that are found by both PyCBC and GstLAL, and Fig, 4 shows the distribution of signal triggers that are found only by one pipeline.

B. Noise distribution
The noise distribution is fit using data with a nonphysical time shift higher than the light travel time between detectors to remove correlations between any real signals.Generating this data is computationally expensive, therefore, in this proof-of-concept analysis, we use O3a data with a single time shift where LIGO Livingston and Virgo data streams were shifted by 0.62831 and 0.31415 s, respectively, in relation to LIGO Hanford.
To define corresponding noise triggers between the pipelines, we match the times of the time-shifted triggers within a window of 1 s, comparable to the typical window used for grouping triggers in O3 [2].This procedure gives us 256 GstLAL triggers and 348 PyCBC triggers that pass the FAR threshold of 2 day −1 ; of these, only 12 triggers are common between pipelines.For a unified p astro analysis intended to be used in production of a GW catalog, a larger set of noise triggers would be required.
The small number of noise triggers common between the pipelines makes it difficult to accurately reconstruct the distribution.However, as our primary goal is to illustrate unified p astro method, rather than to produce a full set of results, these triggers are sufficient for completing a demonstrative calculation.To make use of the small number of triggers that are common between the pipelines, we adopt a bootstrapping procedure to generate more triggers.To do this, we first fit triggers that pass a 10 day −1 cut to a KDE, and draw 10 4 triggers that pass a 2 day −1 cut.This corresponds to doing ≈ 17 time slides in total.To avoid overfitting the noise triggers, we choose to employ a truncated Gaussian fit to them, instead of a KDE.The assumption of a Gaussian is purely for simplicity, and a more careful choice of estimating the noise distribution would be needed for a proper analysis.The single-pipeline noise triggers are fit using a truncated Gaussian with mean µ = β(2 day −1 ) and a standard deviation calculated from the noise triggers.The joint noise distribution is fit with a 2-dimensional truncated Gaussian with µ = β(2 day −1 ) and the covariance matrix calculated from the noise triggers.Figure 3b shows the distribution of joint triggers, while Fig. 4 shows the single-pipeline triggers in this bootstrapped data set.
The sparseness of noise triggers means that the noise distribution is not accurately reconstructed.This is liable to give biased values for p astro : where the noise likelihood is underestimated, p astro will be underestimated, and where the noise likelihood is overestimated, p astro will be underestimated.However, the goal of this application is to show that the unified p astro formalism yields sensible results to illustrate what would be needed for future analyses.A realistic application of this process to GW posterior corresponds to 90% uncertainty levels, while the contours in the two-dimensional posteriors are the 50% and 90% levels.We recover a median value Λ s = 53 +10 −13 .data will likely require multiple time shifts or some other process to generate a larger number of high-fidelity noise triggers, and then a careful reconstruction of the shape of the noise distribution.

C. Computing a unified pastro
After restricting ourselves to ones that are at least Hanford-Livingston coincident, and have been found by at least one of GstLAL or PyCBC, we are left with 553 on-source triggers from the GWTC-2.1 O3a results [2].Using the signal and noise models described above, we can now proceed to calculate a unified p astro for these triggers.
First, through Eq. ( 10), we estimate Λ s and Λ n for the joint analysis.We do this using the Markov-chain Monte Carlo sampler emcee [60].We use uniform priors on the count parameters, Λ s ∈ [0, 1000] and Λ n ∈ [0, 1000].The triggers are then scored against the appropriate distribution to calculate p(x i |S) and p(x i |∅).
Figure 5 shows the posterior distribution of Λ s and Λ n , with a median Λ s of 53.Table I lists the 41 triggers that have unified p astro ≥ 0.5.Most triggers are reported by both GstLAL and PyCBC, but we do have a few triggers that are detected by GstLAL alone, which corresponds to the high β tail of the on-source triggers in the top panel of Fig. 4.This is qualitatively consistent with the 44 triggers found with a p astro ≥ 0.5 by at least one pipeline in GWTC-2.1 analysis [2].
In Table I, we recover 26 triggers that are also reported with p astro > 0.5 in at least one search pipeline in the

TABLE I:
Triggers with a unified p astro ≥ 0.5 from our illustrative analysis.The triggers that have p astro ≥ 0.5 in at least one pipeline in GWTC-2.1 [19] are shown in the second column.Also listed are the FARs of the triggers from the GstLAL and PyCBC pipelines.The GW name of the triggers also recovered with p astro ≥ 0.5 by at least one pipeline in GWTC-2.1 is given in the second column.These results illustrate the properties of the unified p astro method, but a larger number of noise triggers, and more accurate population models, would be needed to obtain reliable quantitative results.
GWTC-2.1 analysis [2] (the corresponding GW identification of these triggers has been given for easy identification) while the remaining 15 triggers were below this threshold.Inspecting the β values of these triggers show that they are usually low, and comparing with the noise distributions, it is plausible that they could be consistent with noise if some of the modeling assumptions are relaxed.The simplified modeling of the noise distribution is expected to lead to the promotion of some noise triggers while suppressing some real signals.
In the case of joint triggers there is a heavy weight in favor of signal triggers as discussed in Sec.V.The exact quantitative results might be susceptible to small-number statistics, and it is plausible that some of these would be down weighted if we had a more complete noise distribution from more time slides.However, the results show the expected behavior that having multiple pipelines find a candidate when their noise distributions are largely uncorrelated, increases our certainty that a candidate is real.
In Table II, we show the list of triggers with a unified p astro < 0.5, but with a p astro ≥ 0.5 in either GstLAL or PyCBC in GWTC-2.1 [19].While the correlation between pipelines increases p astro for some candidates, the unified analysis also down-weights certain triggers, in particular those with highly asymmetric β values.This is because the joint simulated-signal trigger distribution has a strong correlation between the β (and thereby the FAR) values of two pipelines; this is expected since both are matched filter pipelines using CBC templates.For example, GW190803 022701 has GstLAL and PyCBC FARs of 7.3 × 10 −2 yr −1 and 81 yr −1 , respectively.Therefore, while the GstLAL significance of the trigger is high, the asymmetric FARs in the two pipelines gives it a lower weight in the unified analysis.In our case, the low p astro can potentially be explained by the choice of a Gaussian for the noise distribution with heavy tails at large β.The uncertainties involved in the noise and the signal distributions in this illustrative analysis, and the fact that we do not include other search pipelines, mean that we should not rule out the triggers in Table II.Nevertheless, the results demonstrate that there is information that is uniquely captured by a joint analysis of multiple pipelines.In practice, we would expect that any highsignificance candidate that is a clear outlier in the signal distribution would warrant further investigation to understand why the pipelines differ in their response.
Finally, many of the triggers in Table I have a p astro close to 1.This is likely an overestimate, due to the sparseness of the noise distribution which can lead to an artificially low noise likelihood in Eq. (15).We equally expect that some of the low-probability candidates, not shown in the table, have underestimated p astro for the same reason.The specific values we get are also dependent on the strong modeling assumptions made for the noise distribution.A more realistic analysis would probably require adequate non-parametric modeling of the noise distribution such that any structure in the parameter space can be identified.While it is worth bearing these limitations in mind, both Table I and Fig. 5 qualitatively demonstrate consistency of the results with GWTC-2.1, indicating that even with simpler modeling choices this method can yield sensible overall results.

VI. DISCUSSION AND CONCLUSION
We have developed a statistical formalism for combining information from multiple search pipelines to calculate a unified p astro .We first demonstrated this formalism using a simple toy model, showing that it can consistently combine information and even account for biases in search pipelines.We then applied the framework to O3a data, using triggers from GstLAL and PyCBC, demonstrating how correlations between pipelines may update our understanding of candidates, but highlighting the importance of using accurate models for the signal and noise populations.
Currently, multiple search pipelines are used to identify interesting candidates, each with its own strengths.A unified p astro can potentially reduce confusion in the interpretation of triggers found by different pipelines and can help in using marginal triggers for subsequent GW analyses.We have shown that certain types of correlations between pipelines can inform the significance that the unified analysis assigns to a trigger.Similarly, combining the search pipelines' results mitigates the need to calculate an effective trials factor to correct the individualpipeline FARs to account for repeated analysis of the same data.Calculating this trials factor is generally nontrivial.Our formalism can naturally and consistently account for these considerations since it calculates how triggers in one pipeline are correlated with those in another pipeline.
While we have demonstrated that this method can estimate signal and noise counts, and a unified p astro in a sensible and consistent manner, the paucity of noise triggers is a clear computational bottleneck.The scarcity of these triggers can be understood by considering the FAR thresholds used.The 2 day −1 threshold used means that we will be limited to a few hundred noise triggers in each pipeline per a six month run.Therefore, depending on the structure in the joint noise correlations, we would probably need O(10-100) time slides to obtain an accurate representation of the noise distribution.The number of noise triggers that are common between pipelines  will be only a fraction of the total number of noise triggers, and if the noise backgrounds are distinct, then there will only be a small fraction in common.This may indicate a need for multiple coordinated time shifts between pipelines to build up the noise distribution.One can also consider other methods of building noise distributions such as modeling the joint noise distribution and drawing from it, similar in philosophy to the GstLAL pipeline [47].However, while the low number of common noise triggers does pose a difficulty for reconstructing the shape of the noise distribution, it also highlights the importance of considering the number of candidates in common between pipelines: if common noise triggers are rare, but common signal triggers are frequent, then a candidate being found by multiple pipelines should increase its p astro .We already see this in the analysis done here where a joint trigger is about 100 times more likely under the signal hypothesis.
If suitable data products (injection sets and noise triggers) are available, this formalism can be readily applied to include other search pipelines, notably those used by the LVK such as MBTA [36,37] and cWB [39][40][41], as well as pipelines developed for external analysis of public data [61].This would enable the construction of a single GW catalog accounting for all analyses.A strength of this formalism when constructing joint catalogs is that it would differentially weigh search pipelines, depending on their precision and accuracy by taking into account the shape of the joint simulated-signal distribution.Hence, as long as the simulated-signal distribution sufficiently reflects the true distribution of GW sources, it should be possible to correct for pipelines missing signals over some region of parameter space, or producing spurious triggers in another.
The addition of cWB (and other minimally modeled pipelines [62,63]) to catalog production could be extremely useful, as it can be sensitive in regions of the CBC parameter space where modeled searches do not perform well, such as eccentric binaries [64], in addition to non-CBC transient sources.The sensitivity to non-CBC sources is a complication for current analyses, as the p astro calculations are done assuming only CBC sources (discussed in Appendix F of the GWTC-3 paper [18]).Non-CBC source have a lower true alarm rate than CBC sources, and hence a lower p astro at a given FAR; consequently, misidentifying a trigger as CBC will lead to an overestimate of its p astro .To mitigate this, LVK analyses have imposed an additional criterion for cWB candidates, that they must have a counterpart trigger from a template-based CBC pipeline [2,14].This is effectively an approximation to the unified p astro framework, acknowledging that for real CBC signals there would probably be correlation between pipeline results.This could be put on a more rigorous basis by explicitly using out unified p astro framework, considering the response of the pipelines to simulated CBC and non-CBC signals.
Addition of more pipelines can make the problem of noise triggers more acute, as the number of triggers that are common between three or more pipelines might be small and extrapolation difficult.However, it is probable that noise triggers in a multiple-pipeline space would be so rare that we can conservatively set the noise likelihood to a small limiting value for such a case.
In this paper, we have only considered the simplest case of calculating a unified p astro .In a future work, we will extend it to implement binning in the mass space to estimate the astrophysical probability that a system is a BBH, a BNS or a neutron star-black hole binary (NSBH) [25].Such an extension is in principle straightforward, where in addition to a statistic like the FAR we also fit distribution of recovered template chirp masses for simulated BBH, BNS and NSBH signal.On-source triggers can then be scored against such two-dimensional distributions to give the corresponding likelihood.Searches are often optimized in different ways giving them differential sensitivity in specific parts of the mass space (e.g., PyCBC has an analysis tuned to BBHs [18,49]).Calculating unified probabilities in conjugation with mass binning will allow us to fold in this differential sensitivity in a consistent, unbiased way in a way akin to the toy model example with Pipeline 3 and Pipeline 4 in Sec.IV.
A key uncertainty in the calculation of p astro is the form of the underlying source population.Errors in the assumed population translate to a misestimation of the true alarm rate, and hence p astro .A way to mitigate this would be to infer the population simultaneously while calculating p astro [44,45,65].This additionally enables lower-significance candidates to be used in population inference.Using a unified p astro makes it easier to assess the contamination fraction in a catalog based upon several pipelines, and hence would make it more convenient to perform such joint inference in the future.
While we have only considered the application of our method to final search analyses performed offline, an extension to low-latency detections is also theoretically possible.By combining multiple pipelines we depend less on one pipeline and hence should be less susceptible to incorrect alerts and retractions.In combination with mass binning, this could be extremely useful for electromagnetic follow-up of GW candidates [66].Since many such followups are often target-of-opportunity observations, improving the reliability of trigger information can be valuable in evaluating the proper usage of scarce telescope time.The relative scarcity of noise triggers could be a bigger computational issue in low latency, as the joint noise distributions will have to be continuously reevaluated at a reasonable cadence in order to account for the changing detector state [22,23].Therefore, further work would be needed to identify methods that could potentially be used to reconstruct the noise distribution in low latency.
Finally, Bayesian frameworks have been developed to assess the probability of multimessenger detections [67][68][69][70].These calculate the probabilities associated with candidates being background noise or astrophysical signals from different sources or the same source.Our unified p astro framework naturally feeds into these calculations.Furthermore, our framework would enable the extension of these calculations to consider how a nondetection in one messenger impacts the probability that a candidate in a counterpart messenger is real.Given the rich science that may result from a multimessenger discovery [6,71,72], this may be a valuable avenue of future investigation.
The data used in this paper for the analysis of GWTC-2.1 triggers is available as a Zenodo repository [73].

FIG. 1 :
FIG.1:A schematic of the toy model data.Each row is a segment, consisting of four data points indicated by the rounded rectangles.The noise in the segments is drawn from a standard normal distribution, with its value indicated by the color scale of the rounded rectangle.Segments 2 and 4 from the top in this example contain a signal, added randomly to one of the four data points indicated by the vermilion star.

FIG. 2 :
FIG.2: Toy-model exploration of the unified p astro formalism.A signal value, λ s = 3 was used in both cases.In both figures, the solid vertical and horizontal lines indicate the real number of segments with (and without) a signal.The shaded regions in the marginal plots are 90% uncertainty levels while the two-dimensional contours correspond to 50% and 90% levels.
The signal distribution p(⃗ x|S) learned using a KDE to the simulated signals that are commonly found by PyCBC β (b) The noise distribution p(⃗ x|∅) fit to a 2-dimensional Gaussian to the common noise triggers.The stars are the O3a on-source triggers from GWTC-2.1 found by both pipelines[19].

FIG. 4 :
FIG. 4:The distribution of triggers that were found by one pipeline only.The top plot shows the noise and signal distribution of GstLAL, while the bottom plot shows the PyCBC distribution of triggers.The black stars are again the on-source O3a triggers from GWTC-2.1 that have only been found by the corresponding pipeline[19].

FIG. 5 :
FIG. 5:The posterior for the signal and noise counts in the joint analysis.The shaded region in the one-dimensional posterior corresponds to 90% uncertainty levels, while the contours in the two-dimensional posteriors are the 50% and 90% levels.We recover a median value Λ s = 53 +10 −13 .

TABLE II :
[19]gers with a unified p astro < 0.5 in our illustrative analysis, but with a p astro ≥ 0.5 in either GstLAL or PyCBC in GWTC-2.1[19]